Representing 3D models with discriminative visual elements

Size: px

Start display at page:

Download "Representing 3D models with discriminative visual elements"

Aron Julian Eaton
5 years ago
Views:

1 Representing 3D models with discriminative visual elements Mathieu Aubry June 23th 2014 Josef Sivic (INRIA-ENS) Bryan Russell (Intel) Painting-to-3D Model Alignment Via Discriminative Visual Elements M. Aubry, B. Russell and J. Sivic, TOG 2014 (will be presented at SIGGRAPH 2014)

5 Motivation: browsing visual content Intelligent visual memory: organize and search your visual record

6 Motivation: history / archeology New ways to access and analyse data for archeology, history or architecture [Russell, Sivic, Ponce, Dessales, 2011] Example: evolution of a particular place over time

7 Motivation: reasoning on non realistic images There are many non-photographic depictions of our world : Ultimate goal: to reason about these depictions

8 From the beginning of computer vision First PhD in computer vision, MIT 1963 Lawrence G. Roberts Machine perception of three-dimensional solids Photograph 3D model

9 1980s: 2D-3D Alignment [Huttenlocher and Ullman IJCV 1990] Alignment: Huttenlocher & Ullman (198 [Lowe AI 1987] (a) (b) (c) [Faugeras&Hebert 86], [Grimson&Lozano-Perez 86], igure 2.2: Object recognit ion example from t he work of Lowe (1987) [88]. (a) 3D wire-fram

10 Difficulty Limits of local feature matching using SIFT: Figure from [A. Shrivastava, T. Malisiewicz, A. Gupta, A. Efros, Data-driven Visual Similarity for Cross-domain Image Matching, SIGGRAPH Asia 2011] Very large space of possible viewpoints. See also: [Chum & Matas CVPR 2006] [Schechtman & Irani, CVPR 2007] [Russell, Sivic, Ponce, Dessalles 2011] [Hauagge & Snavely CVPR 2012]

11 Related work: retrieval using a global descriptor [Russell, Sivic, Ponce, Dessales, 2011] used GIST [Oliva and Torralba 2001] [Shrivastava, Malisiewicz, Gupta, Efros, 2011]

Related work: retrieval using contours [Baatz,

Jurie, Schmid 08], [Opelt, Pintz, Zisserman 06],

12 Related work: retrieval using contours [Baatz, Saurer, Köser, Pollefeys, 12] [Russell, Sivic, Ponce, Dessales, 2011] [Lowe 87], [Huttenlocher&Ullman 99], [Shotton, Blake, Cipolla 05], [Ferrari, Fevrier, Jurie, Schmid 08], [Opelt, Pintz, Zisserman 06], [Arandjelovic&Zisserman 11], [Baboud, Cadik, Eisemann, Seidel, 2011],

13 Inspiration: category recognition based on HOGs Introduced in [Dalal and Triggs 2005] Extension to object parts [Felzenszwalb et al. 2010] Fast approximation [Hariharan, Malik, Ramanan. 2012]

14 Inspiration: mid-level visual elements Learn a vocabulary of discriminative visual elements that characterize a city. [Doersch, Singh, Gupta, Sivic, Efros, What makes Paris look like Paris?, SIGGRAPH 2012] See also [Bourdev & Malik ICCV 2009], [Singh et al. ECCV 2012], [Juneja et al. CVPR 2013], [Jain et al. CVPR 2013],

15 High level idea Limitation of egde detection Soft edge representation (HOG, SIFT ) Machine learning (DPM, CNN ) Key points detection Dense evaluation RANSAC Grouping (Bag of words ) Category level recognition 3D scene reconstruction Place / instance recognition Bring together these methods again!

16 This work: 3D discriminative visual elements Summarize a 3D model with a set of discriminative elements view-dependent distinct 3D fragments

17 Problem statement Inputs Output 3D model Painting Camera parameters Camera center, rotation, principal point, focal length

18 I. Synthesize representative views Synthesize ~10,000 viewpoints for an architectural site See also: [Irschara et al. CVPR 2009], [Baatz et al. ECCV 2012]

19 I. Synthesize representative views Synthesize ~10,000 viewpoints for an architectural site We will have views similar to the paintings, that makes the problem easier! See also: [Irschara et al. CVPR 2009], [Baatz et al. ECCV 2012]

20 II. Select informative patches Idea: match the views using informative patches.

21 II. Select informative patches Idea: match the views using informative patches.?? Problem: how to select the informative patches?

22 II. Select informative patches Discriminability in patch space mn

23 II. Select informative patches Discriminability in patch space F mn F(m)

24 II. Select informative patches Discriminability in patch space F mn F(m)

25 II. Select informative patches Discriminability in patch space F mn ints. Camera positions are sampled 100 grid. 24 camera orientations are iewing any portion of the 3D model n about 45,000 valid views for this in-plane rotation of the camera. ple 2-3 vertical rotations (pitch el. Views where no significant re discarded. This procedure reviews depending on the size of pled camera positions isshown viewsform only an intermediate d after element detectors are ex- x Fig. 3. M atching as classification. Given a region and its HOG descriptor q in a rendered view (top left) the aim is to find the corresponding region F(m)

26 4 M. Aubry et al. II. Select informative patches Discriminability in patch space mn ints. Camera positions are sampled 100 grid. 24 camera orientations are iewing any portion of the 3D model n about 45,000 valid views for this in-plane rotation of the camera. ple 2-3 vertical rotations (pitch el. Views where no significant re discarded. This procedure reviews depending on the size of pled camera positions isshown viewsform only an intermediate d after element detectors are ex- x Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular grid. 24 camera orientations are used for F each viewpoint. Cameras not viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the 3D model isvisible are discarded. This procedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Note also that as we aim to use 3D models of different type and quality, we did not use any advanced rendering techniques, e.g. to produce illumination effects such as shadows or specularities. Learning discriminative visual elements from rendered views is described next. F(m) 4.2 Finding discriminative candidate elements The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, we formulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for Fig. 3. M atching as classification. thousands Given of candidate a region and visual its HOG elements descriptor densely sampled in each q in a rendered view (top left) rendered the aimview. is tothe findquality the corresponding of the trained region classifier (measured by x Fig. 3. M atching ascla q in a rendered view (top in a painting (top right). sliding window classifier of negative data. The clas showing the positive (+) and spatial locations. The imum of the classification learnt w weights diffe contrast tothestandard x have the same weigh ing per-exemplar dist support vector machin for object recognition tava et al. 2011]. Here matching using local m Parameters w and b

II. Select informative patches Discriminability in patch space Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular 100 100 grid.

27 II. Select informative patches Discriminability in patch space Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular grid. 24 camera orientations are used for each viewpoint. Cameras not viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. l. mn viewpoints. Camera positions are sampled ar grid. 24 camera orientations are ras not viewing any portion of the 3D model results in about 45,000 valid views for this x tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch F angles) depending on the 3D model. Views where no significant portion of the 3D model isvisible are discarded. This procedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Note also that as we aim to use 3D models of different type and quality, we did not use any advanced rendering techniques, e.g. to produce illumination effects such as shadows or specularities. Learning discriminative visual elements from rendered views is described next. 4.2 Finding discriminative candidate elements F(m) The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, we formulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual ing no in-plane rotation of the camera. wesample 2-3 vertical rotations (pitch 3D model. Views where no significant visible are discarded. This procedure re- 45,000 views depending on the size of the sampled camera positions isshown elements are matched to the input painting, and relate the proposed ndered viewsform only an intermediate Fig. 3. M atchingapproach as classification. to other Given recent a region work on andclosed-form its HOG descriptor training of HOG- Fig. 3. M atching ascla q in a rendered view (top in a painting x (top right). sliding window classifier of negative data. The clas showing the positive (+) and spatial locations. The imum of the classification learnt w weights diffe contrast tothestandard x have the same weigh ing per-exemplar dist support vector machin for object recognition tava et al. 2011]. Here matching using local m Parameters w and b of the following form E (w, b) = L 1,w

28 II. Select informative patches Discriminability in patch space mn x F x F(q1) F(q2) x x q2 F(m) Big F(q) = discriminative

to-3d Model Alignment Via Discriminative Visual Elements 5 can w- (2) nts terthat

Select informative patches We evaluate he whitened norm densely on the rendered

to-3d Model Alignment Via Discriminative Visual Elements 5 (3) (6) (3) Fig. 4.

First row: the value of discriminability shown as a heat-map for three different

learnt from a patch at the particular image location.

29 to-3d Model Alignment Via Discriminative Visual Elements 5 can w- (2) nts terthat and in can (4) w- (2) nts terthat (5) and les, Note: in III. Select informative patches We evaluate he whitened norm densely on the rendered view and perform non-max suppression. to-3d Model Alignment Via Discriminative Visual Elements 5 (3) (6) (3) Fig. 4. Selection of discr iminative visual elements. First row: the value of discriminability shown as a heat-map for three different scales (left to right). Red indicates high discriminability. Blueindicates low discriminability. The discriminability is inversely proportional to the training cost of a classifier learnt from a patch at the particular image location. Second row: example visual elements at the local maxima of the discriminability score. - The Can corresponding be thought of local as a maxima generalization are also indicated of local feature using x detection. in the heatmaps Equivalent above. to minimizing least square classification - loss

30 III. Select stable patches Filter out elements unstable across viewpoint. 3D model provides ground truth matches in near-by views Require elements to be reliably detectable in near-by views See also [Doersch et al. SIGGRAPH 2012] [Singh et al. ECCV 2012], [Juneja et al. CVPR 2013]

31 Summary: representation of the 3D model Back-project learnt discriminative elements onto the 3D model

32 IV. Matching patches Query region q: 1. Represent query region q using HOG descriptor

33 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Query region q: tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match 1. Represent query measured region by the Euclidean q distance using betweenhog the descriptors, descriptor we train a linear classifier with q as a single positive example (with label y q =+1) and a large number of negativeexamples x i for i = 1 to N (with labels y i = 1). The matching is then performed by finding the patch x in the painting with the highest classification score s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. Parameters w and b are obtained by minimizing a cost function of the following form E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011] Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto au-

34 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto auw tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements wpoints. Camera positions are are sampled grid. Query grid camera orientations are are not viewing viewing region any any q: portion of of the the 3D 3Dmodel in ltsabout in about 45,000 45,000 valid valid views views for for this this The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match measured by the Euclidean distance between the descriptors, we no1. Represent query region of q the using HOG in-plane rotation of the camera. to N (with labels y sample vertical vertical rotations i (pitch model. Views Views where where score no no significant train a linear classifier with q as a single positive example (with label y 2. Train a linear classifier f(x) = w T q =+1) and a large number of negativeexamples x x+b using i for i = 1 = 1). The matching is then performed by q as a finding the patch x positive example and large in the painting with the highest classification number of negatives s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image Here matchingused using local mid-level for weighted image structures. matching Parameters w and b are obtained by minimizing a cost function of the following form w + w - E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011].

35 Classifier training Train classifier for each candidate region q: {q,+1}, {x i,-1} for i = 1..N (set of generic negatives) q Example: hinge loss (e-svm) Example: square loss

36 Classifier training For square loss, cost E can be minimized in closed form [Bach&Harchaoui 2008; Gharbi et al. 2012; Hariharan et al. 2012]

37 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto auw tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements wpoints. Camera positions are are sampled grid. Query grid camera orientations are are not viewing viewing region any any q: portion of of the the 3D 3Dmodel in ltsabout in about 45,000 45,000 valid valid views views for for this this The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match no in-plane rotation of of the the camera. to N (with labels y sample 2-3 vertical i 2-3 vertical rotations (pitch model. Views Views where where score no no significant 1. Represent query measured region by the Euclidean q using distance between HOG the descriptors, descriptor we train a linear classifier with q as a single 2. Train a linear classifier f(x) = w T positive example (with label y q =+1) and a large number of negativeexamples x x+b using i for i = 1 = 1). The matching is then performed by q as a finding the patch x positive example and large in the painting with the highest classification number of negatives s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image Here matchingused using local mid-level for weighted image structures. matching Parameters w and b are obtained by minimizing a cost function of the following form w + w - E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011].

4.2.2 Selection of discriminative visual elements via least squares 4.2.2 regression.

38 4.2.2 Selection of discriminative visual elements via least squares regression. Selection of Sodiscriminative far we have assumed visual that elements the position via least and squares scale ofregression. the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven. Asstor- on the onground the ground plane plane on aon regular a regular grid. grid camera orientations are are used used for each for each viewpoint. Cameras Cameras not not viewing any any portion of of the the 3D 3Dmodel are discarded. are discarded. This This procedure procedure results results in about in about 45,000 45,000 valid valid views views for for this this 3D model. 3D model. Query region q: Matching patches tal camera tal camera rotations rotations assuming assuming no no in-plane rotation of of the the camera. For each For each horizontal horizontal rotation rotation wesample vertical vertical rotations (pitch angles) angles) depending depending on the on the 3D 3D model. model. Views Views where where no no significant portion portion of the of 3D the model 3D model is isvisible are are discarded. This This procedure resultsults in between in between 7,000 7,000 and and 45,000 45,000 views views depending on on the the size size of of re- the 3D thesite. 3D site. An example An example of the of the sampled sampled camera camera positions isshown in figure in figure 2. Note 2. Note that that the the rendered rendered viewsform only only an an intermediate representation representation and and can can be discarded be discarded after after element element detectors detectors are are extractedtracted. Note Note also also that that as we as we aim aim to use to use 3D 3D models models of of different different type type ex- and quality, and quality, wedid not did not use use any any advanced advanced rendering rendering techniques, e.g. e.g. to produce to produce illumination illumination effects effects such such as shadows as shadows or or specularities. Learning Learning discriminative discriminative visual visual elements elements from fromrendered rendered viewsis is described described next. next Finding Finding discriminative discriminative candidate candidate elements elements The The aim aim is toisfind find a set a set of mid-level of mid-level visual visual elements elements that that are are discriminative discriminative for the for the given given 3D 3D site. site. In the In the following, following, we we formulate formulate image image matching matching as a as discriminative a discriminative classification classification task task and and show show that for that a for specific a specific choice choice of loss of loss function function the the classifier classifier can can be be computed computed in a in closed-form a closed-form without without computationally computationally expensive expensive iterative training. iterative training. In turn, In turn, this this enables enables efficient efficient training training of of classifiers classifiers for for thousands of candidate visual elements densely sampled in each thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the training error) is then used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual the lowest training error). Finally, we show how the learnt visual elements are matched to the input painting, and relate the proposed elements approach are matched to other to recent the input work painting, on closed-form and relate training the proposed of HOGbased approach to linear other classifiers recent work [Gharbi on et closed-form al. 2012; Hariharan training et of al. HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012]. 2012] Matching as classification. The aim isto match a given rectangular Matching patch as q (represented classification. by ahog The aim [Dalal isto and match Triggs a given 2005] rectangular descriptor) patch in q a (represented rendered view by ahog to its corresponding [Dalal and Triggs patch 2005] in the descriptor) painting, inasa illustrated rendered in view figure to 3. itsinstead corresponding of finding patch the best inmatch the painting, measured illustrated by the Euclidean in figure 3. distance Instead between of finding the the descriptors, best matchwe Matching asclassification. Given aregion and its HOG descriptor where the first term measures the loss on the positive example q where and the the second first term term measures measures the the loss loss L on on the the negative positive data. example Note q and that the for second simplicity term weignore measures in(2) the theregularization loss on the negative term data. w 2 Note.A that particular for simplicity case of weignore this approach in(2) isthe theregularization exemplar-svm term [Malisiewicz w 2.A particular et al. 2011; case Shrivastava of this approach et al. isthe 2011], exemplar-svm where the loss [Malisiewicz L (y, s(x)) etbetween al. 2011; theshrivastava label y and etpredicted al. 2011], score where s(x) theisloss hinge-loss L (y, s(x)) between L (y, s(x)) the = label max{ y and 0, 1 predicted ys(x)} [Bishop score s(x) 2006]. is the For hinge-loss exemplar L SVM (y, s(x)) cost (2) = isconvex max{ 0, 1and ys(x)} can be minimized [Bishop 2006]. using iterative For exemplar algo- SVM cost (2) is convex and can be minimized using iterative algo- 1. Represent measured train aby linear query theclassifier Euclidean region with distance q as a single between q using positive the descriptors, example HOG (with we descriptor la- a linear y trainbel 2. Train a linear q =+1) classifier and alarge with qnumber as a single of negative positiveexamples classifier f(x) = w T x(with x+b i for i label yto q =+1) N (withand labels a large y i = number 1). The of negativeexamples matching is then performed x i i = 1by to N = 1 rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding (with the labels patch y i x= 3. Find the best match in 1). thethe painting matching with the is then highest performed classification by rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding the patch x in the painting in the with the painting highest classification maximizing the classification score f(x) score score s(x) = w > x + b, (1) s(x) = w > x + b, (1) w + w - Fig. Fig M atching asclassification. Given aregion and its HOG descriptor q qinina arendered view (top left) the aim is to find the corresponding region in ina apainting (top (topright). This is isachieved by training a linear HOG-based sliding sliding window classifier using q as a positive example and a large number of of negative negative data. data. The Theclassifier weight vector w isvisualized by by separately showing showing the thepositive (+) (+) and and negative (-) weights at at different orientations and and spatial spatial locations. locations. The Thebest match x in the painting is isfound as asthe themax- imum imum of ofthe theclassification score. score. learnt learnt ww weights weightsdifferent components of x differently. This Thisisisin in contrast contrast to to thestandard thestandard Euclidean Euclidean distance whereall whereall components components of of x x have have the the same same weight. weight. Note Note that that a similar similar idea idea was was used used in in learning learning per-exemplar per-exemplar distances distances [Frome [Frome et et al. al. 2007] 2007] or or per-exemplar per-exemplar support support vector vector machine(svm) machine (SVM) classifiers classifiers [Malisiewicz [Malisiewicz et et al. al. 2011] 2011] for for object object recognition recognition and and cross-domain cross-domain image image retrieval retrieval [Shrivastava et al. 2011]. Here, we build on this work and apply it to image [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. matching using local mid-level image structures. Parameters and are obtained by minimizing cost function Parameters w and b are obtained by minimizing a cost function of the following form of the following form E (w, b) = L 1,w T E (w, b) = L 1,w T q + b + 1 N X N N i =1 i =1 1,w T i b, (2) L 1,w T x i + b, (2)

the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven.

39 4.2.2 Selection of discriminative visual elements via least squares regression. Selection of Sodiscriminative far we have assumed visual that elements the position via least and squares scale ofregression. the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven. Asstor- on the onground the ground plane plane on aon regular a regular grid. grid camera orientations are are used used for each for each viewpoint. Cameras Cameras not not viewing any any portion of of the the 3D 3Dmodel are discarded. are discarded. This This procedure procedure results results in about in about 45,000 45,000 valid valid views views for for this this 3D model. 3D model. Query region q: IV. Matching patches tal camera tal camera rotations rotations assuming assuming no no in-plane rotation of of the the camera. For each For each horizontal horizontal rotation rotation wesample vertical vertical rotations (pitch angles) angles) depending depending on the on the 3D 3D model. model. Views Views where where no no significant portion portion of the of 3D the model 3D model is isvisible are are discarded. This This procedure resultsults in between in between 7,000 7,000 and and 45,000 45,000 views views depending on on the the size size of of re- the 3D thesite. 3D site. An example An example of the of the sampled sampled camera camera positions isshown in figure in figure 2. Note 2. Note that that the the rendered rendered viewsform only only an an intermediate representation representation and and can can be discarded be discarded after after element element detectors detectors are are extractedtracted. Note Note also also that that as we as we aim aim to use to use 3D 3D models models of of different different type type ex- and quality, and quality, wedid not did not use use any any advanced advanced rendering rendering techniques, e.g. e.g. to produce to produce illumination illumination effects effects such such as shadows as shadows or or specularities. Learning Learning discriminative discriminative visual visual elements elements from fromrendered rendered viewsis is described described next. next Finding Finding discriminative discriminative candidate candidate elements elements The The aim aim is toisfind find a set a set of mid-level of mid-level visual visual elements elements that that are are discriminative discriminative for the for the given given 3D 3D site. site. In the In the following, following, we we formulate formulate image image matching matching as a as discriminative a discriminative classification classification task task and and show show that for that a for specific a specific choice choice of loss of loss function function the the classifier classifier can can be be computed computed in a in closed-form a closed-form without without computationally computationally expensive expensive iterative training. iterative training. In turn, In turn, this this enables enables efficient efficient training training of of classifiers classifiers for for thousands of candidate visual elements densely sampled in each thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the training error) is then used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual the lowest training error). Finally, we show how the learnt visual elements are matched to the input painting, and relate the proposed elements approach are matched to other to recent the input work painting, on closed-form and relate training the proposed of HOGbased approach to linear other classifiers recent work [Gharbi on et closed-form al. 2012; Hariharan training et of al. HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012]. 2012] Matching as classification. The aim isto match a given rectangular Matching patch as q (represented classification. by ahog The aim [Dalal isto and match Triggs a given 2005] rectangular descriptor) patch in q a (represented rendered view by ahog to its corresponding [Dalal and Triggs patch 2005] in the descriptor) painting, inasa illustrated rendered in view figure to 3. itsinstead corresponding of finding patch the best inmatch the painting, measured illustrated by the Euclidean in figure 3. distance Instead between of finding the the descriptors, best matchwe Best match: where the first term measures the loss on the positive example q where and the the second first term term measures measures the the loss loss L on on the the negative positive data. example Note q and that the for second simplicity term weignore measures in(2) the theregularization loss on the negative term data. w 2 Note.A that particular for simplicity case of weignore this approach in(2) isthe theregularization exemplar-svm term [Malisiewicz w 2.A particular et al. 2011; case Shrivastava of this approach et al. isthe 2011], exemplar-svm where the loss [Malisiewicz L (y, s(x)) etbetween al. 2011; theshrivastava label y and etpredicted al. 2011], score where s(x) theisloss hinge-loss L (y, s(x)) between L (y, s(x)) the = label max{ y and 0, 1 predicted ys(x)} [Bishop score s(x) 2006]. is the For hinge-loss exemplar L SVM (y, s(x)) cost (2) = isconvex max{ 0, 1and ys(x)} can be minimized [Bishop 2006]. using iterative For exemplar algo- SVM cost (2) is convex and can be minimized using iterative algo- 1. Represent measured train aby linear query theclassifier Euclidean region with distance q as a single between q using positive the descriptors, example HOG (with we descriptor la- a linear y trainbel 2. Train a linear q =+1) classifier and alarge with qnumber as a single of negative positiveexamples classifier f(x) = w T x(with x+b i for i label yto q =+1) N (withand labels a large y i = number 1). The of negativeexamples matching is then performed x i i = 1by to N = 1 rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding (with the labels patch y i x= 3. Find the best match in 1). thethe painting matching with the is then highest performed classification by rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding the patch x in the painting in the with the painting highest classification maximizing the classification score f(x) score score s(x) = w > x + b, (1) s(x) = w > x + b, (1) w + w - Fig. Fig MMatching asclassification. Given aregion and its HOG descriptor q qinina arendered view (top left) the aim is to find the corresponding region in ina apainting (top (topright). This is isachieved by training a linear HOG-based sliding sliding window classifier using q as a positive example and a large number of of negative negative data. data. The Theclassifier weight vector w isvisualized by by separately showing showing the thepositive (+) (+) and and negative (-) weights at at different orientations and and spatial spatial locations. locations. The Thebest match x in the painting is isfound as asthe themax- imum imum of ofthe theclassification score. score. learnt learnt ww weights weightsdifferent components of x differently. This Thisisisin in contrast contrast to to thestandard thestandard Euclidean Euclidean distance whereall whereall components components of of x x have have the the same same weight. weight. Note Note that that a similar similar idea idea was was used used in in learning learning per-exemplar per-exemplar distances distances [Frome [Frome et et al. al. 2007] 2007] or or per-exemplar per-exemplar support support vector vector machine(svm) machine (SVM) classifiers classifiers [Malisiewicz [Malisiewicz et et al. al. 2011] 2011] for for object object recognition recognition and and cross-domain cross-domain image image retrieval retrieval [Shrivastava et al. 2011]. Here, we build on this work and apply it to image [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. matching using local mid-level image structures. Parameters and are obtained by minimizing cost function Parameters w and b are obtained by minimizing a cost function of the following form of the following form E (w, b) = L 1,w T E (w, b) = L 1,w T q + b + 1 N X N N i =1 i =1 1,w T i b, (2) L 1,w T x i + b, (2)

40 V. Recover viewpoint 1. Run all the elements detectors and find best matches -> We need to compare the detector scores. This can be seen as a calibration problem. We use an affine calibration. Results on [Huagge and Snavely 2012] dataset

41 V. Recover viewpoint 1. Run all the elements detectors 2. Run RANSAC on the top 25 matches

42 Historical photograph Results

43 Historical photograph Results

44 Drawings Results

45 Drawings Results

46 Engravings Results

47 Engravings Results

48 Paintings Results

49 Watercolors Results

50 Watercolors Results

51 Results

52 Quantitative Results On a database of 337 depictions with 4 3D models:

53 CVPR 2014 work Oral session 9B, Friday 13:45 Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models M. Aubry, B. Russell, A Efros and J. Sivic CVPR 2014 (oral) Mathieu AUBRY 58 INRIA -TUM

54 CVPR 2014 work Use of large (1300) database of 3D model to help recognition. Mathieu AUBRY 59 INRIA -TUM

55 Approach Style Use rendered views from the 3D models. (62 views per model -> views) Mathieu AUBRY 60 Viewpoint INRIA -TUM

56 Main difference Calibration of the detectors calibrated score density 0.01% of the scores -1 Initial score Mathieu AUBRY 61 INRIA -TUM

58 Conclusion Mid-level visual elements 2D-3D alignment can help: Understanding non-photographic depictions Representing object categories Oral session 9B, Friday 13:45 Mathieu AUBRY 63 INRIA -TUM

59 Questions? Mathieu AUBRY 64 INRIA -TUM

Painting-to-3D Model Alignment Via Discriminative Visual Elements. Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA)

Painting-to-3D Model Alignment Via Discriminative Visual Elements Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA) This work Browsing across time 1699 1740 1750 1800 1800 1699 1740 1750