arxiv: v3 [cs.cv] 31 Oct 2016

Size: px

Start display at page:

Download "arxiv: v3 [cs.cv] 31 Oct 2016"

Sabina Foster
6 years ago
Views:

1 Unversal Correspondene Network Chrstopher B. Choy Stanford Unversty JunYoung Gwak Stanford Unversty Slvo Savarese Stanford Unversty arxv: v3 [s.cv] 31 Ot 2016 Manmohan Chandraker NEC Laboratores Amera, In. Abstrat We present a deep learnng framework for aurate vsual orrespondenes and demonstrate ts effetveness for both geometr and semant mathng, spannng aross rgd motons to ntra-lass shape or appearane varatons. In ontrast to prevous CNN-based approahes that optmze a surrogate path smlarty objetve, we use deep metr learnng to dretly learn a feature spae that preserves ether geometr or semant smlarty. Our fully onvolutonal arhteture, along wth a novel orrespondene ontrastve loss allows faster tranng by effetve reuse of omputatons, aurate gradent omputaton through the use of thousands of examples per mage par and faster testng wth O(n) feed forward passes for n keyponts, nstead of O(n 2 ) for typal path smlarty methods. We propose a onvolutonal spatal transformer to mm path normalzaton n tradtonal features lke SIFT, whh s shown to dramatally boost auray for semant orrespondenes aross ntra-lass shape varatons. Extensve experments on KITTI, PASCAL, and CUB-2011 datasets demonstrate the sgnfant advantages of our features over pror works that use ether hand-onstruted or learned features. 1 Introduton Correspondene estmaton s the workhorse that drves several fundamental problems n omputer vson, suh as 3D reonstruton, mage retreval or objet reognton. Applatons suh as struture from moton or panorama stthng that demand sub-pxel auray rely on sparse keypont mathes usng desrptors lke SIFT [23]. In other ases, dense orrespondenes n the form of stereo dspartes, optal flow or dense trajetores are used for applatons suh as surfae reonstruton, trakng, vdeo analyss or stablzaton. In yet other senaros, orrespondenes are sought not between projetons of the same 3D pont n dfferent mages, but between semant analogs aross dfferent nstanes wthn a ategory, suh as beaks of dfferent brds or headlghts of ars. Thus, n ts most general form, the noton of vsual orrespondene estmaton spans the range from low-level feature mathng to hgh-level objet or sene understandng. Tradtonally, orrespondene estmaton reles on hand-desgned features or doman-spef prors. In reent years, there has been an nreasng nterest n leveragng the power of onvolutonal neural networks (CNNs) to estmate vsual orrespondenes. For example, a Samese network may take a par of mage pathes and generate ther smlarty as the output [1, 36, 37]. Intermedate onvoluton layer atvatons from the above CNNs are also usable as gener features. However, suh ntermedate atvatons are not optmzed for the vsual orrespondene task. Suh features are traned for a surrogate objetve funton (path smlarty) and do not neessarly form a metr spae for vsual orrespondene and thus, any metr operaton suh as dstane does not have explt nterpretaton. In addton, path smlarty s nherently neffent, sne features have to be

Fgure 1: Varous types of orrespondene problems have tradtonally requred dfferent spealzed methods: for example, SIFT or SURF for sparse struture from moton, DAISY or DSP for dense mathng, SIFT Flow

2 Fgure 1: Varous types of orrespondene problems have tradtonally requred dfferent spealzed methods: for example, SIFT or SURF for sparse struture from moton, DAISY or DSP for dense mathng, SIFT Flow or FlowWeb for semant mathng. The Unversal Correspondene Network aurately and effently learns a metr spae for geometr orrespondenes, dense trajetores or semant orrespondenes. extrated even for overlappng regons wthn pathes. Further, t requres O(n 2 ) feed-forward passes to ompare eah of n pathes wth n other pathes n a dfferent mage. In ontrast, we present the Unversal Correspondene Network (UCN), a CNN-based gener dsrmnatve framework that learns both geometr and semant vsual orrespondenes. Unlke many prevous CNNs for path smlarty, we use deep metr learnng to dretly learn the mappng, or feature, that preserves smlarty (ether geometr or semant) for gener orrespondenes. The mappng s, thus, nvarant to projetve transformatons, ntra-lass shape or appearane varatons, or any other varatons that are rrelevant to the onsdered smlarty. We propose a novel orrespondene ontrastve loss that allows faster tranng by effently sharng omputatons and effetvely enodng neghborhood relatons n feature spae. At test tme, orrespondene redues to a nearest neghbor searh n feature spae, whh s more effent than evaluatng parwse path smlartes. The UCN s fully onvolutonal, allowng effent generaton of dense features. We propose an on-the-fly atve hard-negatve mnng strategy for faster tranng. In addton, we propose a novel adaptaton of the spatal transformer [14], alled the onvolutonal spatal transformer, desgned to make our features nvarant to partular famles of transformatons. By the learnng optmal feature spae that ompensates for affne transformatons, the onvolutonal spatal transformer mparts the ablty to mm path normalzaton of desrptors suh as SIFT. Fgure 1 llustrates our framework. The apabltes of UCN are ompared to a few mportant pror approahes n Table 1. Emprally, the orrespondenes obtaned from the UCN are denser and more aurate than most pror approahes spealzed for a partular task. We demonstrate ths expermentally by showng state-of-the-art performanes on sparse SFM on KITTI, as well as dense geometr or semant orrespondenes on both rgd and non-rgd bodes n KITTI, PASCAL and CUB datasets. To summarze, we propose a novel end-to-end system that optmzes a general orrespondene objetve, ndependent of doman, wth the followng man ontrbutons: Deep metr learnng wth an effent orrespondene onstrastve loss for learnng a feature representaton that s optmzed for the gven orrespondene task. Fully onvolutonal network for dense and effent feature extraton, along wth fast atve hard negatve mnng. Fully onvolutonal spatal transformer for path normalzaton. State-of-the-art orrespondenes aross sparse SFM, dense mathng and semant mathng, enompassng rgd bodes, non-rgd bodes and ntra-lass shape or appearane varatons. 2 Related Works Correspondenes Vsual features form bas buldng bloks for many omputer vson applatons. Carefully desgned features and kernel methods have nfluened many felds suh as struture from moton, objet reognton and mage lassfaton. Several hand-desgned features, suh as SIFT, HOG, SURF and DAISY have found wdespread applatons [23, 3, 30, 8]. 2

3 Fgure 2: System overvew: The network s fully onvolutonal, onsstng of a seres of onvolutons, poolng, nonlneartes and a onvolutonal spatal transformer, followed by hannel-wse L2 normalzaton and orrespondene ontrastve loss. As nputs, the network takes a par of mages and oordnates of orrespondng ponts n these mages (blue: postve, red: negatve). Features that orrespond to the postve ponts (from both mages) are traned to be loser to eah other, whle features that orrespond to negatve ponts are traned to be a ertan margn apart. Before the last L2 normalzaton and after the FCNN, we plaed a onvolutonal spatal transformer to normalze pathes or take larger ontext nto aount. Features Dense Geometr Corr. Semant Corr. Tranable Effent Metr Spae SIFT [23] DAISY [30] Conv4 [22] DeepMathng [26] Path-CNN [36] LIFT [35] Ours Table 1: Comparson of pror state-of-the-art methods wth UCN (ours). The UCN generates dense and aurate orrespondenes for ether geometr or semant orrespondene tasks. The UCN dretly learns the feature spae to aheve hgh auray and has dstnt effeny advantages, as dsussed n Seton 3. Reently, many CNN-based smlarty measures have been proposed. A Samese network s used n [36] to measure path smlarty. A drvng dataset s used to tran a CNN for path smlarty n [1], whle [37] also uses a Samese network for measurng path smlarty for stereo mathng. A CNN pretraned on ImageNet s analyzed for vsual and semant orrespondene n [22]. Correspondenes are learned n [17] aross both appearane and a global shape deformaton by explotng relatonshps n fne-graned datasets. In ontrast, we learn a metr spae n whh metr operatons have dret nterpretatons, rather than optmzng the network for path smlarty and usng the ntermedate features. For ths, we mplement a fully onvolutonal arhteture wth a orrespondene ontrastve loss that allows faster tranng and testng and propose a onvolutonal spatal transformer for loal path normalzaton. Metr learnng usng neural networks Neural networks are used n [5] for learnng a mappng where the Euldean dstane n the spae preserves semant dstane. The loss funton for learnng smlarty metr usng Samese networks s subsequently formalzed by [7, 13]. Reently, a trplet loss s used by [31] for fne-graned mage rankng, whle the trplet loss s also used for fae reognton and lusterng n [27]. Mn-bathes are used for effently tranng the network n [28]. CNN nvaranes and spatal transformatons A CNN s nvarant to some types of transformatons suh as translaton and sale due to onvoluton and poolng layers. However, expltly handlng suh nvaranes n forms of data augmentaton or explt network struture yelds hgher auray n many tasks [18, 16, 14]. Reently, a spatal transformer network s proposed n [14] to learn how to zoom n, rotate, or apply arbtrary transformatons to an objet of nterest. Fully onvolutonal neural network Fully onneted layers are onverted n 1 1 onvolutonal flters n [21] to propose a fully onvolutonal framework for segmentaton. Changng a regular CNN to a fully onvolutonal network for deteton leads to speed and auray gans n [12]. Smlar to these works, we gan the effeny of a fully onvolutonal arhteture through reusng atvatons for overlappng regons. Further, sne number of tranng nstanes s muh larger than number of mages n a bath, varane n the gradent s redued, leadng to faster tranng and onvergene. 3

4 Fgure 3: Correspondene ontrastve loss takes three nputs: two dense features extrated from mages and a orrespondene table for postve and negatve pars. 3 Unversal Correspondene Network Methods # examples per # feed forwards mage par per test Samese Network 1 O(N 2 ) Trplet Loss 2 O(N) Contrastve Loss 1 O(N) Corres. Contrast. Loss > 10 3 O(N) Table 2: Comparsons between metr learnng methods for vsual orrespondene. Feature learnng allows faster test tmes. Correspondene ontrastve loss allows us to use many more orrespondenes n one par of mages than other methods. We now present the detals of our framework. Reall that the Unversal Correspondene Network s traned to dretly learn a mappng that preserves smlarty nstead of relyng on surrogate features. We dsuss the fully onvolutonal nature of the arhteture, a novel orrespondene ontrastve loss for faster tranng and testng, atve hard negatve mnng, as well as the onvolutonal spatal transformer that enables path normalzaton. Fully Convolutonal Feature Learnng To speed up tranng and use resoures effently, we mplement fully onvolutonal feature learnng, whh has several benefts. Frst, the network an reuse some of the atvatons omputed for overlappng regons. Seond, we an tran several thousand orrespondenes for eah mage par, whh provdes the network an aurate gradent for faster learnng. Thrd, hard negatve mnng s effent and straghtforward, as dsussed subsequently. Fourth, unlke path-based methods, t an be used to extrat dense features effently from mages of arbtrary szes. Durng testng, the fully onvolutonal network s faster as well. Path smlarty based networks suh as [1, 36, 37] requre O(n 2 ) feed forward passes, where n s the number of keyponts n eah mage, as ompared to only O(n) for our network. We note that extratng ntermedate layer atvatons as a surrogate mappng s a omparatvely suboptmal hoe sne those atvatons are not dretly traned on the vsual orrespondene task. Correspondene Contrastve Loss Learnng a metr spae for vsual orrespondene requres enodng orrespondng ponts (n dfferent vews) to be mapped to neghborng ponts n the feature spae. To enode the onstrants, we propose a generalzaton of the ontrastve loss [7, 13], alled orrespondene ontrastve loss. Let F I (x) denote the feature n mage I at loaton x = (x, y). The loss funton takes features from mages I and I, at oordnates x and x, respetvely (see Fgure 3). If the oordnates x and x orrespond to the same 3D pont, we use the par as a postve par that are enouraged to be lose n the feature spae, otherwse as a negatve par that are enouraged to be at least margn m apart. We denote s = 1 for a postve par and s = 0 for a negatve par. The full orrespondene ontrastve loss s gven by L = 1 2N N s F I (x ) F I (x ) 2 + (1 s ) max(0, m F I (x) F I (x ) ) 2 (1) For eah mage par, we sample orrespondenes from the tranng set. For nstane, for KITTI dataset, f we use eah laser san pont, we an tran up to 100k ponts n a sngle mage par. However n prate, we used 3k orrespondenes to lmt memory onsumpton. Ths allows more aurate gradent omputatons than tradtonal ontrastve loss, whh yelds one example per mage par. We agan note that the number of feed forward passes at test tme s O(n) ompared to O(n 2 ) for Samese network varants [1, 37, 36]. Table 2 summarzes the advantages of a fully onvolutonal arhteture wth orrespondene ontrastve loss. Hard Negatve Mnng The orrespondene ontrastve loss n Eq. (1) onssts of two terms. The frst term mnmzes the dstane between postve pars and the seond term pushes negatve pars to be at least margn m away from eah other. Thus, the seond term s only atve when the dstane between the features F I (x ) and F I (x ) are smaller than the margn m. Suh boundary defnes the metr spae, so t s rual to fnd the negatves that volate the onstrant and tran the network to push the negatves away. However, random negatve pars do not ontrbute to tranng sne they are are generally far from eah other n the embeddng spae. 4

5 (a) SIFT (b) Spatal transformer () Convolutonal spatal transformer Fgure 4: (a) SIFT normalzes for rotaton and salng. (b) The spatal transformer takes the whole mage as an nput to estmate a transformaton. () Our onvolutonal spatal transformer apples an ndependent transformaton to features. Instead, we atvely mne negatve pars that volate the onstrants the most to dramatally speed up tranng. We extrat features from the frst mage and fnd the nearest neghbor n the seond mage. If the loaton s far from the ground truth orrespondene loaton, we use the par as a negatve. We ompute the nearest neghbor for all ground truth ponts on the frst mage. Suh mnng proess s tme onsumng sne t requres O(mn) omparsons for m and n feature ponts n the two mages, respetvely. Our experments use a few thousand ponts for n, wth m beng all the features on the seond mage, whh s as large as We use a GPU mplementaton to speed up the K-NN searh [11] and embed t as a Caffe layer to atvely mne hard negatves on-the-fly. Convolutonal Spatal Transformer CNNs are known to handle some degree of sale and rotaton nvaranes. However, handlng spatal transformatons expltly usng data-augmentaton or a speal network struture have been shown to be more suessful n many tasks [14, 16, 17, 18]. For vsual orrespondene, fndng the rght sale and rotaton s rual, whh s tradtonally aheved through path normalzaton [24, 23]. A seres of smple onvolutons and poolngs annot mm suh omplex spatal transformatons. To mm path normalzaton, we borrow the dea of the spatal transformer layer [14]. However, nstead of a global mage transformaton, eah keypont n the mage an undergo an ndependent transformaton. Thus, we propose a onvolutonal verson to generate the transformed atvatons, alled the onvolutonal spatal transformer. As demonstrated n our experments, ths s espeally mportant for orrespondenes aross large ntra-lass shape varatons. The proposed transformer takes ts nput from a lower layer and for eah output feature, apples an ndependent spatal transformaton. The transformaton parameters are also extrated onvolutonally. Sne they go through an ndependent transformaton, the transformed atvatons are plaed nsde a larger atvaton wthout overlap and then go through a suessve onvoluton wth the strde to ombne the transformed atvatons ndependently. The strde sze has to be equal to the sze of the spatal transformer kernel sze. Fgure 4 llustrates the onvolutonal spatal transformer module. 4 Experments We use Caffe [15] pakage for mplementaton. Sne t does not support the new layers we propose, we mplement the orrespondene ontrastve loss layer and the onvolutonal spatal transformer layer, the K-NN layer based on [11] and the hannel-wse L2 normalzaton layer. We dd not use flattenng layer nor the fully onneted layer to make the network fully onvolutonal, generatng features at every fourth pxel. For aurate loalzaton, we then extrat features densely usng blnear nterpolaton to mtgate quantzaton error for sparse orrespondenes. Please refer to the supplementary materals for the network mplementaton detals and vsualzaton. For eah experment setup, we tran and test three varatons of networks. Frst, the network has hard negatve mnng and spatal transformer (Ours-HN-ST). Seond, the same network wthout spatal transformer (Ours-HN). Thrd, the same network wthout spatal transformer and hard negatve mnng, provdng random negatve samples that are at least ertan pxels apart from the ground truth orrespondene loaton nstead (Ours-RN). Wth ths onfguraton of networks, we verfy the effetveness of eah omponent of Unversal Correspondene Network. 5

method SIFT-NN [23] HOG-NN [8] SIFT-flow [20] DasyFF [33] DSP [19] DM best ( 1 /2) [26] Ours-HN Ours-HN-ST MPI-Sntel 68.4 71.2 89.0 87.3 85.3 89.2 91.5 90.7 KITTI 48.9 53.7 67.3 79.6 58.0 85.6 86.

Note that DasyFF, DSP, DM use global optmzaton whereas we only use the raw orrespondenes from nearest neghbor mathes. PCK Comparson PCK Comparson Auray SIFT DAISY KAZE Agrawal et al.

Ours-HN Ours-HN-ST Pxel Thresholds (b) PCK performane on keyponts NN Fgure 5: Comparson of PCK performane on KITTI raw dataset (a) PCK performane of the densely extrated feature nearest neghbor (b)

Fgure 6: Vsualzaton of nearest neghbor (NN) mathes on KITTI mages (a) from top to bottom, frst and seond mages and FAST keyponts and dense keyponts on the frst mage (b) NN of SIFT mathes on seond

6 method SIFT-NN [23] HOG-NN [8] SIFT-flow [20] DasyFF [33] DSP [19] DM best ( 1 /2) [26] Ours-HN Ours-HN-ST MPI-Sntel KITTI Table 3: Mathng performane PCK@10px on KITTI Flow 2015 [25] and MPI-Sntel [6]. Note that DasyFF, DSP, DM use global optmzaton whereas we only use the raw orrespondenes from nearest neghbor mathes. PCK Comparson PCK Comparson Auray SIFT DAISY KAZE Agrawal et al. Ours-HN Ours-HN-ST Pxel Thresholds (a) PCK performane for dense features NN Auray SIFT DAISY KAZE Agrawal et al. Ours-HN Ours-HN-ST Pxel Thresholds (b) PCK performane on keyponts NN Fgure 5: Comparson of PCK performane on KITTI raw dataset (a) PCK performane of the densely extrated feature nearest neghbor (b) PCK performane for keypont features nearest neghbor and the dense CNN feature nearest neghbor (a) Orgnal mage par and keyponts (b) SIFT [23] NN mathes () DAISY [30] NN mathes (d) Ours-HN NN mathes Fgure 6: Vsualzaton of nearest neghbor (NN) mathes on KITTI mages (a) from top to bottom, frst and seond mages and FAST keyponts and dense keyponts on the frst mage (b) NN of SIFT mathes on seond mage. () NN of dense DAISY mathes on seond mage. (d) NN of our dense UCN mathes on seond mage. Datasets and Metrs We evaluate our UCN on three dfferent tasks: geometr orrespondene, semant orrespondene and auray of orrespondenes for amera loalzaton. For geometr orrespondene (mathng mages of same 3D pont n dfferent vews), we use two optal flow datasets from KITTI 2015 Flow benhmark and MPI Sntel dataset. For semant orrespondenes (fndng the same funtonal part from dfferent nstanes), we use the PASCAL-Berkeley dataset wth keypont annotatons [10, 4] and a subset used by FlowWeb [38]. We also ompare aganst pror state-of-the-art on the Calteh-UCSD Brd dataset[32]. To test the auray of orrespondenes for amera moton estmaton, we use the raw KITTI drvng sequenes whh nlude Velodyne sans, GPS and IMU measurements. Velodyne ponts are projeted n suessve frames to establsh orrespondenes and any ponts on movng objets are removed. To measure performane, we use the perentage of orret keyponts (PCK) metr [22, 38, 17] (or equvalently auray@t [26]). We extrat features densely or on a set of sparse keyponts (for semant orrespondene) from a query mage and fnd the nearest neghborng feature n the seond mage as the predted orrespondene. The orrespondene s lassfed as orret f the predted keypont s loser than T pxels to ground-truth (n short, PCK@T ). Unlke many pror works, we do not apply any post-proessng, suh as global optmzaton wth an MRF. Ths s to apture the performane of raw orrespondenes from UCN, whh already surpasses prevous methods. Geometr Correspondene We pk random 1000 orrespondenes n eah KITTI or MPI Sntel mage durng tranng. We onsder a orrespondene as a hard negatve f the nearest neghbor n the feature spae s more than 16 pxels away from the ground truth orrespondene. We used the same arhteture and tranng sheme for both datasets. Followng onventon [26], we measure PCK at 10 pxel threshold and ompare wth the state-of-the-art methods on Table 3. SIFT-flow [20], 6

aero bke brd boat bottle bus ar at har ow table dog horse mbke person plant sheep sofa tran tv mean onv4 flow 28.2 34.1 2 17.1 5 36.7 20.9 19.6 15.7 25.4 12.7 18.7 25.9 23.1 21.4 4 21.1 14.5 18.3 33.

7 33.4 14.0 15.5 14.6 3 19.9 Ours RN 31.5 19.6 30.1 23.0 53.5 36.7 34.0 33.7 22.2 28.1 12.8 33.9 29.9 23.4 38.4 39.8 38.6 17.6 28.4 6 36.0 Ours HN 36.0 26.5 31.9 31.3 56.4 38.2 36.2 34.0 25.5 31.7 18.

0 Table 4: Per-lass PCK on PASCAL-Berkeley orrespondene dataset [4] (α = 0.1, L = max(w, h)).

7 aero bke brd boat bottle bus ar at har ow table dog horse mbke person plant sheep sofa tran tv mean onv4 flow SIFT flow NN transfer Ours RN Ours HN Ours HN-ST Table 4: Per-lass PCK on PASCAL-Berkeley orrespondene dataset [4] (α = 0.1, L = max(w, h)). Query Ground Truth Ours HN-ST VGG onv4_3 NN Query Ground Truth Ours HN-ST VGG onv4_3 NN Fgure 7: Qualtatve semant orrespondene results on PASCAL [10] orrespondenes wth Berkeley keypont annotaton [4] and Calteh-UCSD Brd dataset [32]. DasyFF [33], DSP [19], and DM best [26] use addtonal global optmzaton to generate more aurate orrespondenes. On the other hand, just our raw orrespondenes outperform all the state-of-the-art methods. We note that the spatal transformer does not mprove performane n ths ase, lkely due to overfttng to a smaller tranng set. As we show n the next experments, ts benefts are more apparent wth a larger-sale dataset and greater shape varatons. We also used KITTI raw sequenes to generate a large number of orrespondenes, and we splt dfferent sequenes nto tran and test sets. The detals of the splt s on the supplementary materal. We plot PCK for dfferent thresholds for varous methods wth densely extrated features on the larger KITTI raw dataset n Fgure 5a. The auray of our features outperforms all tradtonal features nludng SIFT [23], DAISY [30] and KAZE [2]. Due to dense extraton at the orgnal mage sale wthout rotaton, SIFT does not perform well. So, we also extrat all features exept ours sparsely on SIFT keyponts and plot PCK urves n Fgure 5b. All the pror methods mprove (SIFT dramatally so), but our UCN features stll perform sgnfantly better even wth dense extraton. Also note the mproved performane of the onvolutonal spatal transformer. PCK urves for geometr orrespondenes on ndvdual semant lasses suh as road or ar are n supplementary materal. Semant Correspondene The UCN an also learn semant orrespondenes nvarant to ntralass appearane or shape varatons. We ndependently tran on the PASCAL dataset [10] wth varous annotatons [4, 38] and on the CUB dataset [32], wth the same network arhteture. We agan use PCK as the metr [34]. To aount for varable mage sze, we onsder a predted keypont to be orretly mathed f t les wthn Euldean dstane α L of the ground truth keypont, where L s the sze of the mage and 0 < α < 1 s a varable we ontrol. For omparson, our defnton of L vares dependng on the baselne. Sne ntralass orrespondene algnment s a dffult task, preedng works use ether geometr [19] or learned [17] spatal prors. However, even our raw orrespondenes, wthout spatal prors, aheve stronger results than prevous works. As shown n Table 4 and 5, our approah outperforms that of Long et al.[22] by a large margn on the PASCAL dataset wth Berkeley keypont annotaton, for most lasses and also overall. Note that our result s purely from nearest neghbor mathng, whle [22] uses global optmzaton too. We also tran and test UCN on the CUB dataset [32], usng the same leaned test subset as WarpNet [17]. As shown n Fgure 8, we outperform WarpNet by a large margn. However, please note that WarpNet s an unsupervsed method. Please see Fgure 7 for qualtatve mathes. Results on FlowWeb datasets are n supplementary materal, wth smlar trends. 7

8 mean α = 0.1 α = 5 α = 25 onv4 flow[22] SIFT flow f7 NN ours-rn ours-hn ours-hn-st Table 5: Mean PCK on PASCAL-Berkeley orrespondene dataset [4] (L = max(w, h)). Even wthout any global optmzaton, our nearest neghbor searh outperforms all methods by a large margn. Auray CUB PCK over alpha VGG+DSP VGG-M onv4 DSP SIFT WarpNet Ours-RN Ours-HN Ours-HN-ST Alpha Fgure 8: PCK on CUB dataset [32], ompared wth varous other approahes nludng WarpNet [17] (L = w2 + h 2.) Features SIFT [23] DAISY [30] SURF [3] KAZE [2] Agrawal et al. [1] Ours-HN Ours-HN-ST Ang. Dev. (deg) Trans. Dev.(deg) Table 6: Essental matrx deomposton performane usng varous features. The performane s measured as angular devaton from the ground truth rotaton and the angle between predted translaton and the ground truth translaton. All features generate very aurate estmaton. Fnally, we observe that there s a sgnfant performane mprovement obtaned through use of the onvolutonal spatal transformer, n both PASCAL and CUB datasets. Ths shows the utlty of estmatng an optmal path normalzaton n the presene of large shape deformatons. Camera Moton Estmaton We use KITTI raw sequenes to get more tranng examples for ths task. To augment the data, we randomly rop and mrror the mages and to make effetve use of our fully onvolutonal struture, we use large mages to tran thousands of orrespondenes at one. We establsh orrespondenes wth nearest neghbor mathng, use RANSAC to estmate the essental matrx and deompose t to obtan the amera moton. Among the four anddate rotatons, we hoose the one wth the most nlers as the estmate R pred, whose angular devaton wth respet to the ground truth R gt s reported as θ = aros ( (Tr (R predr gt ) 1)/2 ). Sne translaton may only be estmated up to sale, we report the angular devaton between unt vetors along the estmated and ground truth translaton from GPS-IMU. In Table 6, we lst deomposton errors for varous features. Note that sparse features suh as SIFT are desgned to perform well n ths settng, but our dense UCN features are stll qute ompettve. Note that ntermedate features suh as [1] learn to optmze path smlarty, thus, our UCN sgnfantly outperforms them sne t s traned dretly on the orrespondene task. 5 Conluson We have proposed a novel deep metr learnng approah to vsual orrespondene estmaton, that s shown to be advantageous over approahes that optmze a surrogate path smlarty objetve. We propose several nnovatons, suh as a orrespondene ontrastve loss n a fully onvolutonal arhteture, on-the-fly atve hard negatve mnng and a onvolutonal spatal transformer. These lend apabltes suh as more effent tranng, aurate gradent omputatons, faster testng and loal path normalzaton, whh lead to mproved speed or auray. We demonstrate n experments that our features perform better than pror state-of-the-art on both geometr and semant orrespondene tasks, even wthout usng any spatal prors or global optmzaton. In future work, we wll explore applatons of our orrespondenes for rgd and non-rgd moton or shape estmaton as well as applyng global optmzaton. Aknowledgments Ths work was part of C. Choy s nternshp at NEC Labs. We aknowledge the support of Korea Foundaton of Advaned Studes, Toyota Award #122282, ONR N , and MURI WF911NF Referenes [1] P. Agrawal, J. Carrera, and J. Malk. Learnng to See by Movng. In ICCV,

9 [2] P. F. Alantarlla, A. Bartol, and A. J. Davson. Kaze features. In ECCV, [3] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). CVIU, [4] L. Bourdev and J. Malk. Poselets: Body part detetors traned usng 3d pose annotatons. In ICCV, [5] J. Bromley, I. Guyon, Y. Leun, E. Säknger, and R. Shah. Sgnature verfaton usng a Samese tme delay neural network. In NIPS, [6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Blak. A naturalst open soure move for optal flow evaluaton. In ECCV, [7] S. Chopra, R. Hadsell, and Y. LeCun. Learnng a smlarty metr dsrmnatvely, wth applaton to fae verfaton. In CVPR, volume 1, June [8] N. Dalal and B. Trggs. Hstograms of orented gradents for human deteton. In CVPR, [9] S. Dasgupta. Netsope: network arhteture vsualzer or somethng, [10] M. Everngham, L. Van Gool, C. K. I. Wllams, J. Wnn, and A. Zsserman. The PASCAL Vsual Objet Classes Challenge 2011 (VOC2011) Results. [11] V. Gara, E. Debreuve, F. Nelsen, and M. Barlaud. K-nearest neghbor searh: Fast gpu-based mplementatons and applaton to hgh-dmensonal feature mathng. In ICIP, [12] R. Grshk. Fast R-CNN. ArXv e-prnts, Apr [13] R. Hadsell, S. Chopra, and Y. LeCun. Dmensonalty reduton by learnng an nvarant mappng. In CVPR, [14] M. Jaderberg, K. Smonyan, A. Zsserman, and K. Kavukuoglu. Spatal Transformer Networks. NIPS, [15] Y. Ja, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Grshk, S. Guadarrama, and T. Darrell. Caffe: Convolutonal arhteture for fast feature embeddng. arxv preprnt arxv: , [16] H. Kamng, Z. Xangyu, R. Shaoqng, and J. Sun. Spatal pyramd poolng n deep onvolutonal networks for vsual reognton. In ECCV, [17] A. Kanazawa, D. W. Jaobs, and M. Chandraker. WarpNet: Weakly Supervsed Mathng for Sngle-vew Reonstruton. ArXv e-prnts, Apr [18] A. Kanazawa, A. Sharma, and D. Jaobs. Loally Sale-nvarant Convolutonal Neural Network. In Deep Learnng and Representaton Learnng Workshop: NIPS, [19] J. Km, C. Lu, F. Sha, and K. Grauman. Deformable spatal pyramd mathng for fast dense orrespondenes. In CVPR. IEEE, [20] C. Lu, J. Yuen, and A. Torralba. Sft flow: Dense orrespondene aross senes and ts applatons. PAMI, 33(5), May [21] J. Long, E. Shelhamer, and T. Darrell. Fully onvolutonal networks for semant segmentaton. CVPR, [22] J. Long, N. Zhang, and T. Darrell. Do onvnets learn orrespondene? In NIPS, [23] D. G. Lowe. Dstntve mage features from sale-nvarant keyponts. IJCV, [24] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wde baselne stereo from maxmally stable extremal regons. In BMVC, [25] M. Menze and A. Geger. Objet sene flow for autonomous vehles. In CVPR, [26] J. Revaud, P. Wenzaepfel, Z. Harhaou, and C. Shmd. DeepMathng: Herarhal Deformable Dense Mathng. Ot [27] F. Shroff, D. Kalenhenko, and J. Phlbn. Faenet: A unfed embeddng for fae reognton and lusterng. In CVPR, [28] H. O. Song, Y. Xang, S. Jegelka, and S. Savarese. Deep metr learnng va lfted strutured feature embeddng. In Computer Vson and Pattern Reognton (CVPR), [29] C. Szegedy, W. Lu, Y. Ja, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhouke, and A. Rabnovh. Gong deeper wth onvolutons. In CVPR 2015, [30] E. Tola, V. Lepett, and P. Fua. DAISY: An Effent Dense Desrptor Appled to Wde Baselne Stereo. PAMI, [31] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Phlbn, B. Chen, and Y. Wu. Learnng fne-graned mage smlarty wth deep rankng. In CVPR, [32] P. Welnder, S. Branson, T. Mta, C. Wah, F. Shroff, S. Belonge, and P. Perona. Calteh-UCSD Brds 200. Tehnal Report CNS-TR , Calforna Insttute of Tehnology, [33] H. Yang, W. Y. Ln, and J. Lu. DAISY flter flow: A generalzed approah to dsrete dense orrespondenes. In CVPR, [34] Y. Yang and D. Ramanan. Artulated human deteton wth flexble mxtures of parts. PAMI, [35] K. M. Y, E. Trulls, V. Lepett, and P. Fua. LIFT: Learned Invarant Feature Transform. In ECCV, [36] S. Zagoruyko and N. Komodaks. Learnng to Compare Image Pathes va Convolutonal Neural Networks. CVPR, [37] J. Zbontar and Y. LeCun. Computng the stereo mathng ost wth a CNN. In CVPR, [38] T. Zhou, Y. Jae Lee, S. X. Yu, and A. A. Efros. Flowweb: Jont mage set algnment by weavng onsstent, pxel-wse orrespondenes. In CVPR, June

10 A.1 Network Arhteture We use the ImageNet pretraned GoogLeNet [29], from the bottom onv1 to the nepton_4a layer, but we used strde 2 for the bottom 2 layers and 1 for the rest of the network. We followed the onventon of [27, 28] to normalze the features, whh we found to stablze the gradents durng tranng. Sne we are densely extratng features onvolutonally, we mplement the hannel-wse normalzaton layer whh makes all features have a unt L2 norm. After the nepton_4a layer, we plae the orrespondene ontrastve loss layer whh takes features from both mages as well as the respetve orrespondene oordnates n eah mage. The orrespondenes are densely sampled from ether flow or mathed keyponts. Sne the semant keypont orrespondenes are sparse, we augment them wth random negatve oordnates. When we use the atve hard-negatve samplng, we plae the K-NN layer whh returns the nearest neghbor of query mage keyponts n the referene mage. We vsualze the unversal orrespondene network on Fg. A1. The model nludes the hard negatve mnng, the onvolutonal spatal transfomer, and the orrespondene ontrastve loss. The affe prototxt fle and the nteratve web vsualzaton usng [9] s avalable at edu/projets/un/. A.2 Convolutonal Spatal Transformer The onvolutonal spatal transformer onssts of a number of affne spatal transformers. The number of affne spatal transformers depends on the sze of the mage. For eah spatal transformer, the orgn of the oordnate s at the enter of eah kernel. We denote x s, ys as the x, y oordnates of the sampled ponts from the prevous nput U and x t, yt for x, y oordnates of the ponts on the output layer V. Typally, x t, yt are the oordnates of nodes on a grd. θ j are affne transformaton parameters. The oordnates of the sampled ponts and the target ponts satsfy the followng equaton. ( x s y s ) = θ 21 θ 22 y t [ θ11 θ 12 ] ( x t To get the output V at (x t, yt ), we use blnear nterpolaton to sample values U around (xs, ys ). Let U 00, U 01, U 10, U 11 be the U values at lower left, lower rght, upper left, and upper rght respetvely. V Unm max(0, 1 x s m ) max(0, 1 y s n ) = n m = (x 1 x)(y 1 y)u 00 + (x 1 x)(y y 0 )U 10 + (x x 0 )(y 1 y)u 01 + (x x 0 )(y y 0 )U 11 = (x 1 (θ 11 x t + θ 12 y t ))(y 1 (θ 21 x t + θ 22 y t ))U 00 + (x 1 (θ 11 x t + θ 12 y t ))(θ 21 x t + θ 22 y t y 0 )U 10 + (θ 11 x t + θ 12 y t x 0 )(y 1 (θ 21 x t + θ 22 y t ))U 01 + (θ 11 x t + θ 12 y t x 0 )(θ 21 x t + θ 22 y t y 0 )U 11 The gradents wth respet to the nput features are ) L V L V L V L V V U00 V U10 V U01 V U 11 = L V (x 1 x)(y 1 y) = L V (x 1 x)(y 0 y) = L V (x 0 x)(y 0 y) = L V (x 0 x)(y 1 y) 10

11 Fnally, the gradents wth respet to the transformaton parameters are V θ 11 = H W n m U nmx t max(0, 1 θ 21x t + θ 22y t n ) H W n m U nmx t max(0, 1 θ 21x t + θ 22y t n ) 0 V = x t θ (y 1 y)u00 x t (y y 0 )U x t (y 1 y)u01 + x t (y y 0 )U11 V = y θ (y t 1 y)u00 y(y t 1 y)u y(y t y 0 )U01 + y(y t y 0 )U11 V = x t θ (x 1 x)u00 + x t (x 1 x)u10 22 x t (x x 0 )U01 + x t (x x 0 )U11 V = y θ (x t 1 x)u00 + y(x t 1 x)u10 22 y t (x x 0 )U 01 + y t (x x 0 )U 11 A.3 Addtonal tests for semant orrespondene PASCAL VOC omparson wth FlowWeb We ompared the performane of UCN wth FlowWeb [38]. As shown n Tab. A1, our approah outperforms FlowWeb. Please note that FlowWeb s an optmzaton n unsupervsed settng thus we splt ther data per lass to tran and test our network. aero bke boat bottle bus ar har table mbke sofa tran tv mean DSP FlowWeb [38] Ours-RN Ours-HN Ours-HN-ST Table A1: PCK on 12 rgd PASCAL VOC, as splt n FlowWeb [38] (α = 5, L = max(w, h)). Qualtatve semant math results semant math results. Please refer to Fg A2 and A3 for addtonal qualtatve A.4 Addtonal KITTI Raw Results We used a subset of KITTI raw vdeo sequenes for all our experments. The dataset has 9268 frames whh amounts to 15 mnutes of drvng. Eah frame onssts of Velodyne san, stereo RGB mages, GPS-IMU sensor nput. In addton, we used propretary segmentaton data from NEC to evaluate the performane on dfferent semant lasses. Sene type Cty Road Resdental Tranng 1, 2, 5, 9, 11, 13, 14, 27, 28, 29, 48, 51, 56, 57, 59, 84, 15, 32, 19, 20, 22, 23, 35, 36, 39, 46, 61, 64, 79, Testng 84, 91 52, 70, 79, 86, 87, Table A2: KITTI Correspondene Dataset: we used a subset of all KITTI raw sequenes to onstrut a dataset. We exluded the sequene number 17, 18, 60 sne the senes n the vdeos are mostly stat. Also, we exlude 93 sne the GPS-IMU nputs are too nosy. 11

12 In Fgure A5, we plot the varaton n PCK at 30 pxels for varous amera baselnes n our test set. We label semant lasses on the KITTI raw sequenes and evaluate the PCK performane on dfferent semant lasses n Fgure A4. The urves have same olor odes as Fgure 5 n the man paper. A.5 KITTI Dense Correspondenes In ths seton, we present more qualtatve results of nearest neghbor mathes usng our unversal orrespondene network on KITTI mages on Fg. A6. A.6 Sntel Dense Correspondenes In ths seton, we present more qualtatve results of nearest neghbor mathes usng our unversal orrespondene network on Sntel mages on Fg. A7. 12

13 0/26/2016 Netsope GoogleNet data_loadng mage_1 onv1/7x7_s2 onv1/relu_7x7 mage_2 pool1/3x3_s2 onv1/7x7_s2_p onv1/relu_7x7_p pool1/norm1 pool1/3x3_s2_p onv2/3x3_redue onv2/relu_3x3_redue pool1/norm1_p onv2/3x3 onv2/3x3_redue_p onv2/relu_3x3 onv2/relu_3x3_redue_p onv2/norm2 onv2/3x3_p onv2/relu_3x3_p pool2/3x3_s2 onv2/norm2_p nepton_3a/pool nepton_3a/5x5_redue nepton_3a/relu_5x5_redue nepton_3a/3x3_redue nepton_3a/relu_3x3_redue pool2/3x3_s2_p nepton_3a/pool_proj nepton_3a/relu_pool_proj nepton_3a/5x5 nepton_3a/relu_5x5 nepton_3a/3x3 nepton_3a/relu_3x3 nepton_3a/1x1 nepton_3a/relu_1x1 nepton_3a/pool_p nepton_3a/5x5_redue_p nepton_3a/relu_5x5_redue_p nepton_3a/3x3_redue_p nepton_3a/relu_3x3_redue_p nepton_3a/output nepton_3a/pool_proj_p nepton_3a/relu_pool_proj_p nepton_3a/5x5_p nepton_3a/relu_5x5_p nepton_3a/3x3_p nepton_3a/relu_3x3_p nepton_3a/1x1_p nepton_3a/relu_1x1_p nepton_3b/pool nepton_3b/5x5_redue nepton_3b/relu_5x5_redue nepton_3b/3x3_redue nepton_3b/relu_3x3_redue nepton_3a/output_p nepton_3b/pool_proj nepton_3b/relu_pool_proj nepton_3b/5x5 nepton_3b/relu_5x5 nepton_3b/3x3 nepton_3b/relu_3x3 nepton_3b/1x1 nepton_3b/relu_1x1 nepton_3b/pool_p nepton_3b/5x5_redue_p nepton_3b/relu_5x5_redue_p nepton_3b/3x3_redue_p nepton_3b/relu_3x3_redue_p nepton_3b/output nepton_3b/pool_proj_p nepton_3b/relu_pool_proj_p nepton_3b/5x5_p nepton_3b/relu_5x5_p nepton_3b/3x3_p nepton_3b/relu_3x3_p nepton_3b/1x1_p nepton_3b/relu_1x1_p pool3/3x3_s2 nepton_3b/output_p nepton_4a/5x5_redue nepton_4a/relu_5x5_redue nepton_4a/3x3_redue nepton_4a/relu_3x3_redue pool3/3x3_s2_p nepton_4a/5x5_redue_param nepton_4a/3x3_redue_param nepton_4a/5x5_redue_p nepton_4a/3x3_redue_p nepton_4a/relu_5x5_redue_param nepton_4a/relu_3x3_redue_param nepton_4a/relu_5x5_redue_p nepton_4a/relu_3x3_redue_p nepton_4a/5x5_redue_param2 nepton_4a/3x3_redue_param2 nepton_4a/5x5_redue_param_p nepton_4a/3x3_redue_param_p nepton_4a/relu_5x5_redue_param2 nepton_4a/relu_3x3_redue_param2 nepton_4a/relu_5x5_redue_param_p nepton_4a/relu_3x3_redue_param_p nepton_4a/5x5_redue_param3 nepton_4a/3x3_redue_param3 nepton_4a/5x5_redue_param2_p nepton_4a/relu_5x5_redue_param2_p nepton_4a/3x3_redue_param2_p nepton_4a/relu_3x3_redue_param2_p spatal_transformaton_5x5 spatal_transformaton nepton_4a/5x5_redue_param3_p nepton_4a/3x3_redue_param3_p nepton_4a/pool transformed_nepton_4a/5x5_redue transformed_nepton_4a/3x3_redue spatal_transformaton_5x5_p spatal_transformaton_p nepton_4a/pool_proj nepton_4a/relu_pool_proj nepton_4a/5x5 nepton_4a/relu_5x5 nepton_4a/3x3 nepton_4a/relu_3x3 nepton_4a/1x1 nepton_4a/relu_1x1 nepton_4a/pool_p transformed_nepton_4a/5x5_redue_p transformed_nepton_4a/3x3_redue_p nepton_4a/output nepton_4a/pool_proj_p nepton_4a/relu_pool_proj_p nepton_4a/5x5_p nepton_4a/relu_5x5_p nepton_4a/3x3_p nepton_4a/relu_3x3_p nepton_4a/1x1_p nepton_4a/relu_1x1_p feature1_unnorm nepton_4a/output_p feature1 feature2_unnorm orrespondene feature1_extraton feature2 num_oord knn dst nd dst_slene negatve_proessng hard-negatve-orrespondene PCK par par_loss Fgure A1: Vsualzaton of the unversal orrespondene network wth the hard negatve mnng layer and the onvolutonal spatal transformer. The Samese network shares the same weghts for all layers. To mplement the Samese network n Caffe, we appended _p to all layer names on the seond network. Eah mage goes through the unversal orrespondene network and the output features named feature1 and feature2 are fed nto the K-NN layer to fnd the hard negatves on-the-fly. After the hard negatve mnng, the pars are used to ompute the orrespondene ontrastve loss. 13

14 Query Ground Truth Ours HN-ST VGG onv4_3 NN Query Ground Truth Ours HN-ST VGG onv4_3 NN Fgure A2: Addtonal qualtatve semant orrespondene results on PASCAL [10] orrespondenes wth Berkeley keypont annotaton [4]. Query Ground Truth Ours HN-ST VGG onv4_3 NN Query Ground Truth Ours HN-ST VGG onv4_3 NN Fgure A3: dataset [32]. Addtonal qualtatve semant orrespondene results on Calteh-UCSD Brd 14

15 PCK Comparson for bakground PCK Comparson for road Auray Auray Pxel Thresholds PCK Comparson for sdewalk Pxel Thresholds PCK Comparson for vegetaton Auray Auray Pxel Thresholds PCK Comparson for ar Pxel Thresholds PCK Comparson for pedestran Auray Auray Pxel Thresholds PCK Comparson for sgn Pxel Thresholds PCK Comparson for buldng Auray Auray Pxel Thresholds Pxel Thresholds Fgure A4: PCK evaluatons for semant lasses on KITTI raw dataset 15

16 Auray pxel Camera Baselne (meter) Fgure A5: PCK performane for varous amera baselnes on KITTI raw dataset. Query keyponts of frame t Predted keypont mathes of frame t+1 Fgure A6: Vsualzaton of dense feature nearest neghbor mathes on the Sntel dataset [6]. For eah row, we vsualze the query ponts (left) and the nearest neghbor mathes (rght) on mages wth 1 frame dfferene. 16

17 Query keyponts of frame t Predted keypont mathes of frame t+1 Fgure A7: Vsualzaton of dense feature nearest neghbor mathes on the Sntel dataset [6]. For eah row, we vsualze the query ponts (left) and the nearest neghbor mathes (rght) on mages wth 1 frame dfferene. 17

Universal Correspondence Network

Universal Correspondence Network Christopher B. Choy Stanford University chrischoy@ai.stanford.edu JunYoung Gwak Stanford University jgwak@ai.stanford.edu Silvio Savarese Stanford University ssilvio@stanford.edu