Representing 3D models with discriminative visual elements

Size: px
Start display at page:

Download "Representing 3D models with discriminative visual elements"

Transcription

1 Representing 3D models with discriminative visual elements Mathieu Aubry June 23th 2014 Josef Sivic (INRIA-ENS) Bryan Russell (Intel) Painting-to-3D Model Alignment Via Discriminative Visual Elements M. Aubry, B. Russell and J. Sivic, TOG 2014 (will be presented at SIGGRAPH 2014)

2

3

4

5 Motivation: browsing visual content Intelligent visual memory: organize and search your visual record

6 Motivation: history / archeology New ways to access and analyse data for archeology, history or architecture [Russell, Sivic, Ponce, Dessales, 2011] Example: evolution of a particular place over time

7 Motivation: reasoning on non realistic images There are many non-photographic depictions of our world : Ultimate goal: to reason about these depictions

8 From the beginning of computer vision First PhD in computer vision, MIT 1963 Lawrence G. Roberts Machine perception of three-dimensional solids Photograph 3D model

9 1980s: 2D-3D Alignment [Huttenlocher and Ullman IJCV 1990] Alignment: Huttenlocher & Ullman (198 [Lowe AI 1987] (a) (b) (c) [Faugeras&Hebert 86], [Grimson&Lozano-Perez 86], igure 2.2: Object recognit ion example from t he work of Lowe (1987) [88]. (a) 3D wire-fram

10 Difficulty Limits of local feature matching using SIFT: Figure from [A. Shrivastava, T. Malisiewicz, A. Gupta, A. Efros, Data-driven Visual Similarity for Cross-domain Image Matching, SIGGRAPH Asia 2011] Very large space of possible viewpoints. See also: [Chum & Matas CVPR 2006] [Schechtman & Irani, CVPR 2007] [Russell, Sivic, Ponce, Dessalles 2011] [Hauagge & Snavely CVPR 2012]

11 Related work: retrieval using a global descriptor [Russell, Sivic, Ponce, Dessales, 2011] used GIST [Oliva and Torralba 2001] [Shrivastava, Malisiewicz, Gupta, Efros, 2011]

12 Related work: retrieval using contours [Baatz, Saurer, Köser, Pollefeys, 12] [Russell, Sivic, Ponce, Dessales, 2011] [Lowe 87], [Huttenlocher&Ullman 99], [Shotton, Blake, Cipolla 05], [Ferrari, Fevrier, Jurie, Schmid 08], [Opelt, Pintz, Zisserman 06], [Arandjelovic&Zisserman 11], [Baboud, Cadik, Eisemann, Seidel, 2011],

13 Inspiration: category recognition based on HOGs Introduced in [Dalal and Triggs 2005] Extension to object parts [Felzenszwalb et al. 2010] Fast approximation [Hariharan, Malik, Ramanan. 2012]

14 Inspiration: mid-level visual elements Learn a vocabulary of discriminative visual elements that characterize a city. [Doersch, Singh, Gupta, Sivic, Efros, What makes Paris look like Paris?, SIGGRAPH 2012] See also [Bourdev & Malik ICCV 2009], [Singh et al. ECCV 2012], [Juneja et al. CVPR 2013], [Jain et al. CVPR 2013],

15 High level idea Limitation of egde detection Soft edge representation (HOG, SIFT ) Machine learning (DPM, CNN ) Key points detection Dense evaluation RANSAC Grouping (Bag of words ) Category level recognition 3D scene reconstruction Place / instance recognition Bring together these methods again!

16 This work: 3D discriminative visual elements Summarize a 3D model with a set of discriminative elements view-dependent distinct 3D fragments

17 Problem statement Inputs Output 3D model Painting Camera parameters Camera center, rotation, principal point, focal length

18 I. Synthesize representative views Synthesize ~10,000 viewpoints for an architectural site See also: [Irschara et al. CVPR 2009], [Baatz et al. ECCV 2012]

19 I. Synthesize representative views Synthesize ~10,000 viewpoints for an architectural site We will have views similar to the paintings, that makes the problem easier! See also: [Irschara et al. CVPR 2009], [Baatz et al. ECCV 2012]

20 II. Select informative patches Idea: match the views using informative patches.

21 II. Select informative patches Idea: match the views using informative patches.?? Problem: how to select the informative patches?

22 II. Select informative patches Discriminability in patch space mn

23 II. Select informative patches Discriminability in patch space F mn F(m)

24 II. Select informative patches Discriminability in patch space F mn F(m)

25 II. Select informative patches Discriminability in patch space F mn ints. Camera positions are sampled 100 grid. 24 camera orientations are iewing any portion of the 3D model n about 45,000 valid views for this in-plane rotation of the camera. ple 2-3 vertical rotations (pitch el. Views where no significant re discarded. This procedure reviews depending on the size of pled camera positions isshown viewsform only an intermediate d after element detectors are ex- x Fig. 3. M atching as classification. Given a region and its HOG descriptor q in a rendered view (top left) the aim is to find the corresponding region F(m)

26 4 M. Aubry et al. II. Select informative patches Discriminability in patch space mn ints. Camera positions are sampled 100 grid. 24 camera orientations are iewing any portion of the 3D model n about 45,000 valid views for this in-plane rotation of the camera. ple 2-3 vertical rotations (pitch el. Views where no significant re discarded. This procedure reviews depending on the size of pled camera positions isshown viewsform only an intermediate d after element detectors are ex- x Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular grid. 24 camera orientations are used for F each viewpoint. Cameras not viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the 3D model isvisible are discarded. This procedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Note also that as we aim to use 3D models of different type and quality, we did not use any advanced rendering techniques, e.g. to produce illumination effects such as shadows or specularities. Learning discriminative visual elements from rendered views is described next. F(m) 4.2 Finding discriminative candidate elements The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, we formulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for Fig. 3. M atching as classification. thousands Given of candidate a region and visual its HOG elements descriptor densely sampled in each q in a rendered view (top left) rendered the aimview. is tothe findquality the corresponding of the trained region classifier (measured by x Fig. 3. M atching ascla q in a rendered view (top in a painting (top right). sliding window classifier of negative data. The clas showing the positive (+) and spatial locations. The imum of the classification learnt w weights diffe contrast tothestandard x have the same weigh ing per-exemplar dist support vector machin for object recognition tava et al. 2011]. Here matching using local m Parameters w and b

27 II. Select informative patches Discriminability in patch space Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular grid. 24 camera orientations are used for each viewpoint. Cameras not viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. l. mn viewpoints. Camera positions are sampled ar grid. 24 camera orientations are ras not viewing any portion of the 3D model results in about 45,000 valid views for this x tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch F angles) depending on the 3D model. Views where no significant portion of the 3D model isvisible are discarded. This procedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Note also that as we aim to use 3D models of different type and quality, we did not use any advanced rendering techniques, e.g. to produce illumination effects such as shadows or specularities. Learning discriminative visual elements from rendered views is described next. 4.2 Finding discriminative candidate elements F(m) The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, we formulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual ing no in-plane rotation of the camera. wesample 2-3 vertical rotations (pitch 3D model. Views where no significant visible are discarded. This procedure re- 45,000 views depending on the size of the sampled camera positions isshown elements are matched to the input painting, and relate the proposed ndered viewsform only an intermediate Fig. 3. M atchingapproach as classification. to other Given recent a region work on andclosed-form its HOG descriptor training of HOG- Fig. 3. M atching ascla q in a rendered view (top in a painting x (top right). sliding window classifier of negative data. The clas showing the positive (+) and spatial locations. The imum of the classification learnt w weights diffe contrast tothestandard x have the same weigh ing per-exemplar dist support vector machin for object recognition tava et al. 2011]. Here matching using local m Parameters w and b of the following form E (w, b) = L 1,w

28 II. Select informative patches Discriminability in patch space mn x F x F(q1) F(q2) x x q2 F(m) Big F(q) = discriminative

29 to-3d Model Alignment Via Discriminative Visual Elements 5 can w- (2) nts terthat and in can (4) w- (2) nts terthat (5) and les, Note: in III. Select informative patches We evaluate he whitened norm densely on the rendered view and perform non-max suppression. to-3d Model Alignment Via Discriminative Visual Elements 5 (3) (6) (3) Fig. 4. Selection of discr iminative visual elements. First row: the value of discriminability shown as a heat-map for three different scales (left to right). Red indicates high discriminability. Blueindicates low discriminability. The discriminability is inversely proportional to the training cost of a classifier learnt from a patch at the particular image location. Second row: example visual elements at the local maxima of the discriminability score. - The Can corresponding be thought of local as a maxima generalization are also indicated of local feature using x detection. in the heatmaps Equivalent above. to minimizing least square classification - loss

30 III. Select stable patches Filter out elements unstable across viewpoint. 3D model provides ground truth matches in near-by views Require elements to be reliably detectable in near-by views See also [Doersch et al. SIGGRAPH 2012] [Singh et al. ECCV 2012], [Juneja et al. CVPR 2013]

31 Summary: representation of the 3D model Back-project learnt discriminative elements onto the 3D model

32 IV. Matching patches Query region q: 1. Represent query region q using HOG descriptor

33 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Query region q: tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match 1. Represent query measured region by the Euclidean q distance using betweenhog the descriptors, descriptor we train a linear classifier with q as a single positive example (with label y q =+1) and a large number of negativeexamples x i for i = 1 to N (with labels y i = 1). The matching is then performed by finding the patch x in the painting with the highest classification score s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. Parameters w and b are obtained by minimizing a cost function of the following form E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011] Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto au-

34 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto auw tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements wpoints. Camera positions are are sampled grid. Query grid camera orientations are are not viewing viewing region any any q: portion of of the the 3D 3Dmodel in ltsabout in about 45,000 45,000 valid valid views views for for this this The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match measured by the Euclidean distance between the descriptors, we no1. Represent query region of q the using HOG in-plane rotation of the camera. to N (with labels y sample vertical vertical rotations i (pitch model. Views Views where where score no no significant train a linear classifier with q as a single positive example (with label y 2. Train a linear classifier f(x) = w T q =+1) and a large number of negativeexamples x x+b using i for i = 1 = 1). The matching is then performed by q as a finding the patch x positive example and large in the painting with the highest classification number of negatives s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image Here matchingused using local mid-level for weighted image structures. matching Parameters w and b are obtained by minimizing a cost function of the following form w + w - E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011].

35 Classifier training Train classifier for each candidate region q: {q,+1}, {x i,-1} for i = 1..N (set of generic negatives) q Example: hinge loss (e-svm) Example: square loss

36 Classifier training For square loss, cost E can be minimized in closed form [Bach&Harchaoui 2008; Gharbi et al. 2012; Hariharan et al. 2012]

37 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto auw tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements wpoints. Camera positions are are sampled grid. Query grid camera orientations are are not viewing viewing region any any q: portion of of the the 3D 3Dmodel in ltsabout in about 45,000 45,000 valid valid views views for for this this The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match no in-plane rotation of of the the camera. to N (with labels y sample 2-3 vertical i 2-3 vertical rotations (pitch model. Views Views where where score no no significant 1. Represent query measured region by the Euclidean q using distance between HOG the descriptors, descriptor we train a linear classifier with q as a single 2. Train a linear classifier f(x) = w T positive example (with label y q =+1) and a large number of negativeexamples x x+b using i for i = 1 = 1). The matching is then performed by q as a finding the patch x positive example and large in the painting with the highest classification number of negatives s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image Here matchingused using local mid-level for weighted image structures. matching Parameters w and b are obtained by minimizing a cost function of the following form w + w - E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011].

38 4.2.2 Selection of discriminative visual elements via least squares regression. Selection of Sodiscriminative far we have assumed visual that elements the position via least and squares scale ofregression. the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven. Asstor- on the onground the ground plane plane on aon regular a regular grid. grid camera orientations are are used used for each for each viewpoint. Cameras Cameras not not viewing any any portion of of the the 3D 3Dmodel are discarded. are discarded. This This procedure procedure results results in about in about 45,000 45,000 valid valid views views for for this this 3D model. 3D model. Query region q: Matching patches tal camera tal camera rotations rotations assuming assuming no no in-plane rotation of of the the camera. For each For each horizontal horizontal rotation rotation wesample vertical vertical rotations (pitch angles) angles) depending depending on the on the 3D 3D model. model. Views Views where where no no significant portion portion of the of 3D the model 3D model is isvisible are are discarded. This This procedure resultsults in between in between 7,000 7,000 and and 45,000 45,000 views views depending on on the the size size of of re- the 3D thesite. 3D site. An example An example of the of the sampled sampled camera camera positions isshown in figure in figure 2. Note 2. Note that that the the rendered rendered viewsform only only an an intermediate representation representation and and can can be discarded be discarded after after element element detectors detectors are are extractedtracted. Note Note also also that that as we as we aim aim to use to use 3D 3D models models of of different different type type ex- and quality, and quality, wedid not did not use use any any advanced advanced rendering rendering techniques, e.g. e.g. to produce to produce illumination illumination effects effects such such as shadows as shadows or or specularities. Learning Learning discriminative discriminative visual visual elements elements from fromrendered rendered viewsis is described described next. next Finding Finding discriminative discriminative candidate candidate elements elements The The aim aim is toisfind find a set a set of mid-level of mid-level visual visual elements elements that that are are discriminative discriminative for the for the given given 3D 3D site. site. In the In the following, following, we we formulate formulate image image matching matching as a as discriminative a discriminative classification classification task task and and show show that for that a for specific a specific choice choice of loss of loss function function the the classifier classifier can can be be computed computed in a in closed-form a closed-form without without computationally computationally expensive expensive iterative training. iterative training. In turn, In turn, this this enables enables efficient efficient training training of of classifiers classifiers for for thousands of candidate visual elements densely sampled in each thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the training error) is then used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual the lowest training error). Finally, we show how the learnt visual elements are matched to the input painting, and relate the proposed elements approach are matched to other to recent the input work painting, on closed-form and relate training the proposed of HOGbased approach to linear other classifiers recent work [Gharbi on et closed-form al. 2012; Hariharan training et of al. HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012]. 2012] Matching as classification. The aim isto match a given rectangular Matching patch as q (represented classification. by ahog The aim [Dalal isto and match Triggs a given 2005] rectangular descriptor) patch in q a (represented rendered view by ahog to its corresponding [Dalal and Triggs patch 2005] in the descriptor) painting, inasa illustrated rendered in view figure to 3. itsinstead corresponding of finding patch the best inmatch the painting, measured illustrated by the Euclidean in figure 3. distance Instead between of finding the the descriptors, best matchwe Matching asclassification. Given aregion and its HOG descriptor where the first term measures the loss on the positive example q where and the the second first term term measures measures the the loss loss L on on the the negative positive data. example Note q and that the for second simplicity term weignore measures in(2) the theregularization loss on the negative term data. w 2 Note.A that particular for simplicity case of weignore this approach in(2) isthe theregularization exemplar-svm term [Malisiewicz w 2.A particular et al. 2011; case Shrivastava of this approach et al. isthe 2011], exemplar-svm where the loss [Malisiewicz L (y, s(x)) etbetween al. 2011; theshrivastava label y and etpredicted al. 2011], score where s(x) theisloss hinge-loss L (y, s(x)) between L (y, s(x)) the = label max{ y and 0, 1 predicted ys(x)} [Bishop score s(x) 2006]. is the For hinge-loss exemplar L SVM (y, s(x)) cost (2) = isconvex max{ 0, 1and ys(x)} can be minimized [Bishop 2006]. using iterative For exemplar algo- SVM cost (2) is convex and can be minimized using iterative algo- 1. Represent measured train aby linear query theclassifier Euclidean region with distance q as a single between q using positive the descriptors, example HOG (with we descriptor la- a linear y trainbel 2. Train a linear q =+1) classifier and alarge with qnumber as a single of negative positiveexamples classifier f(x) = w T x(with x+b i for i label yto q =+1) N (withand labels a large y i = number 1). The of negativeexamples matching is then performed x i i = 1by to N = 1 rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding (with the labels patch y i x= 3. Find the best match in 1). thethe painting matching with the is then highest performed classification by rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding the patch x in the painting in the with the painting highest classification maximizing the classification score f(x) score score s(x) = w > x + b, (1) s(x) = w > x + b, (1) w + w - Fig. Fig M atching asclassification. Given aregion and its HOG descriptor q qinina arendered view (top left) the aim is to find the corresponding region in ina apainting (top (topright). This is isachieved by training a linear HOG-based sliding sliding window classifier using q as a positive example and a large number of of negative negative data. data. The Theclassifier weight vector w isvisualized by by separately showing showing the thepositive (+) (+) and and negative (-) weights at at different orientations and and spatial spatial locations. locations. The Thebest match x in the painting is isfound as asthe themax- imum imum of ofthe theclassification score. score. learnt learnt ww weights weightsdifferent components of x differently. This Thisisisin in contrast contrast to to thestandard thestandard Euclidean Euclidean distance whereall whereall components components of of x x have have the the same same weight. weight. Note Note that that a similar similar idea idea was was used used in in learning learning per-exemplar per-exemplar distances distances [Frome [Frome et et al. al. 2007] 2007] or or per-exemplar per-exemplar support support vector vector machine(svm) machine (SVM) classifiers classifiers [Malisiewicz [Malisiewicz et et al. al. 2011] 2011] for for object object recognition recognition and and cross-domain cross-domain image image retrieval retrieval [Shrivastava et al. 2011]. Here, we build on this work and apply it to image [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. matching using local mid-level image structures. Parameters and are obtained by minimizing cost function Parameters w and b are obtained by minimizing a cost function of the following form of the following form E (w, b) = L 1,w T E (w, b) = L 1,w T q + b + 1 N X N N i =1 i =1 1,w T i b, (2) L 1,w T x i + b, (2)

39 4.2.2 Selection of discriminative visual elements via least squares regression. Selection of Sodiscriminative far we have assumed visual that elements the position via least and squares scale ofregression. the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven. Asstor- on the onground the ground plane plane on aon regular a regular grid. grid camera orientations are are used used for each for each viewpoint. Cameras Cameras not not viewing any any portion of of the the 3D 3Dmodel are discarded. are discarded. This This procedure procedure results results in about in about 45,000 45,000 valid valid views views for for this this 3D model. 3D model. Query region q: IV. Matching patches tal camera tal camera rotations rotations assuming assuming no no in-plane rotation of of the the camera. For each For each horizontal horizontal rotation rotation wesample vertical vertical rotations (pitch angles) angles) depending depending on the on the 3D 3D model. model. Views Views where where no no significant portion portion of the of 3D the model 3D model is isvisible are are discarded. This This procedure resultsults in between in between 7,000 7,000 and and 45,000 45,000 views views depending on on the the size size of of re- the 3D thesite. 3D site. An example An example of the of the sampled sampled camera camera positions isshown in figure in figure 2. Note 2. Note that that the the rendered rendered viewsform only only an an intermediate representation representation and and can can be discarded be discarded after after element element detectors detectors are are extractedtracted. Note Note also also that that as we as we aim aim to use to use 3D 3D models models of of different different type type ex- and quality, and quality, wedid not did not use use any any advanced advanced rendering rendering techniques, e.g. e.g. to produce to produce illumination illumination effects effects such such as shadows as shadows or or specularities. Learning Learning discriminative discriminative visual visual elements elements from fromrendered rendered viewsis is described described next. next Finding Finding discriminative discriminative candidate candidate elements elements The The aim aim is toisfind find a set a set of mid-level of mid-level visual visual elements elements that that are are discriminative discriminative for the for the given given 3D 3D site. site. In the In the following, following, we we formulate formulate image image matching matching as a as discriminative a discriminative classification classification task task and and show show that for that a for specific a specific choice choice of loss of loss function function the the classifier classifier can can be be computed computed in a in closed-form a closed-form without without computationally computationally expensive expensive iterative training. iterative training. In turn, In turn, this this enables enables efficient efficient training training of of classifiers classifiers for for thousands of candidate visual elements densely sampled in each thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the training error) is then used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual the lowest training error). Finally, we show how the learnt visual elements are matched to the input painting, and relate the proposed elements approach are matched to other to recent the input work painting, on closed-form and relate training the proposed of HOGbased approach to linear other classifiers recent work [Gharbi on et closed-form al. 2012; Hariharan training et of al. HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012]. 2012] Matching as classification. The aim isto match a given rectangular Matching patch as q (represented classification. by ahog The aim [Dalal isto and match Triggs a given 2005] rectangular descriptor) patch in q a (represented rendered view by ahog to its corresponding [Dalal and Triggs patch 2005] in the descriptor) painting, inasa illustrated rendered in view figure to 3. itsinstead corresponding of finding patch the best inmatch the painting, measured illustrated by the Euclidean in figure 3. distance Instead between of finding the the descriptors, best matchwe Best match: where the first term measures the loss on the positive example q where and the the second first term term measures measures the the loss loss L on on the the negative positive data. example Note q and that the for second simplicity term weignore measures in(2) the theregularization loss on the negative term data. w 2 Note.A that particular for simplicity case of weignore this approach in(2) isthe theregularization exemplar-svm term [Malisiewicz w 2.A particular et al. 2011; case Shrivastava of this approach et al. isthe 2011], exemplar-svm where the loss [Malisiewicz L (y, s(x)) etbetween al. 2011; theshrivastava label y and etpredicted al. 2011], score where s(x) theisloss hinge-loss L (y, s(x)) between L (y, s(x)) the = label max{ y and 0, 1 predicted ys(x)} [Bishop score s(x) 2006]. is the For hinge-loss exemplar L SVM (y, s(x)) cost (2) = isconvex max{ 0, 1and ys(x)} can be minimized [Bishop 2006]. using iterative For exemplar algo- SVM cost (2) is convex and can be minimized using iterative algo- 1. Represent measured train aby linear query theclassifier Euclidean region with distance q as a single between q using positive the descriptors, example HOG (with we descriptor la- a linear y trainbel 2. Train a linear q =+1) classifier and alarge with qnumber as a single of negative positiveexamples classifier f(x) = w T x(with x+b i for i label yto q =+1) N (withand labels a large y i = number 1). The of negativeexamples matching is then performed x i i = 1by to N = 1 rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding (with the labels patch y i x= 3. Find the best match in 1). thethe painting matching with the is then highest performed classification by rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding the patch x in the painting in the with the painting highest classification maximizing the classification score f(x) score score s(x) = w > x + b, (1) s(x) = w > x + b, (1) w + w - Fig. Fig MMatching asclassification. Given aregion and its HOG descriptor q qinina arendered view (top left) the aim is to find the corresponding region in ina apainting (top (topright). This is isachieved by training a linear HOG-based sliding sliding window classifier using q as a positive example and a large number of of negative negative data. data. The Theclassifier weight vector w isvisualized by by separately showing showing the thepositive (+) (+) and and negative (-) weights at at different orientations and and spatial spatial locations. locations. The Thebest match x in the painting is isfound as asthe themax- imum imum of ofthe theclassification score. score. learnt learnt ww weights weightsdifferent components of x differently. This Thisisisin in contrast contrast to to thestandard thestandard Euclidean Euclidean distance whereall whereall components components of of x x have have the the same same weight. weight. Note Note that that a similar similar idea idea was was used used in in learning learning per-exemplar per-exemplar distances distances [Frome [Frome et et al. al. 2007] 2007] or or per-exemplar per-exemplar support support vector vector machine(svm) machine (SVM) classifiers classifiers [Malisiewicz [Malisiewicz et et al. al. 2011] 2011] for for object object recognition recognition and and cross-domain cross-domain image image retrieval retrieval [Shrivastava et al. 2011]. Here, we build on this work and apply it to image [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. matching using local mid-level image structures. Parameters and are obtained by minimizing cost function Parameters w and b are obtained by minimizing a cost function of the following form of the following form E (w, b) = L 1,w T E (w, b) = L 1,w T q + b + 1 N X N N i =1 i =1 1,w T i b, (2) L 1,w T x i + b, (2)

40 V. Recover viewpoint 1. Run all the elements detectors and find best matches -> We need to compare the detector scores. This can be seen as a calibration problem. We use an affine calibration. Results on [Huagge and Snavely 2012] dataset

41 V. Recover viewpoint 1. Run all the elements detectors 2. Run RANSAC on the top 25 matches

42 Historical photograph Results

43 Historical photograph Results

44 Drawings Results

45 Drawings Results

46 Engravings Results

47 Engravings Results

48 Paintings Results

49 Watercolors Results

50 Watercolors Results

51 Results

52 Quantitative Results On a database of 337 depictions with 4 3D models:

53 CVPR 2014 work Oral session 9B, Friday 13:45 Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models M. Aubry, B. Russell, A Efros and J. Sivic CVPR 2014 (oral) Mathieu AUBRY 58 INRIA -TUM

54 CVPR 2014 work Use of large (1300) database of 3D model to help recognition. Mathieu AUBRY 59 INRIA -TUM

55 Approach Style Use rendered views from the 3D models. (62 views per model -> views) Mathieu AUBRY 60 Viewpoint INRIA -TUM

56 Main difference Calibration of the detectors calibrated score density 0.01% of the scores -1 Initial score Mathieu AUBRY 61 INRIA -TUM

57

58 Conclusion Mid-level visual elements 2D-3D alignment can help: Understanding non-photographic depictions Representing object categories Oral session 9B, Friday 13:45 Mathieu AUBRY 63 INRIA -TUM

59 Questions? Mathieu AUBRY 64 INRIA -TUM

Painting-to-3D Model Alignment Via Discriminative Visual Elements. Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA)

Painting-to-3D Model Alignment Via Discriminative Visual Elements. Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA) Painting-to-3D Model Alignment Via Discriminative Visual Elements Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA) This work Browsing across time 1699 1740 1750 1800 1800 1699 1740 1750

More information

Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models

Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models Mathieu Aubry (INRIA) Daniel Maturana (CMU) Alexei Efros (UC Berkeley) Bryan Russell (Intel) Josef Sivic (INRIA)

More information

Painting-to-3D Model Alignment Via Discriminative Visual Elements 1

Painting-to-3D Model Alignment Via Discriminative Visual Elements 1 Painting-to-3D Model Alignment Via Discriminative Visual Elements 1 Mathieu Aubry INRIA 2 / TU München and Bryan C. Russell Intel Labs and Josef Sivic INRIA 2 This paper describes a technique that can

More information

Painting-to-3D Model Alignment Via Discriminative Visual Elements

Painting-to-3D Model Alignment Via Discriminative Visual Elements Painting-to-3D Model Alignment Via Discriminative Visual Elements Mathieu Aubry, Bryan C. Russell, Josef Sivic To cite this version: Mathieu Aubry, Bryan C. Russell, Josef Sivic. Painting-to-3D Model Alignment

More information

Where was this picture painted? - Localizing paintings by alignment to 3D models

Where was this picture painted? - Localizing paintings by alignment to 3D models Where was this picture painted? - Localizing paintings by alignment to 3D models Mathieu Aubry, Bryan C. Russell, Josef Sivic To cite this version: Mathieu Aubry, Bryan C. Russell, Josef Sivic. Where was

More information

Development in Object Detection. Junyuan Lin May 4th

Development in Object Detection. Junyuan Lin May 4th Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,

More information

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce Object Recognition Computer Vision Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce How many visual object categories are there? Biederman 1987 ANIMALS PLANTS OBJECTS

More information

Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models

Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan C. Russell, Josef Sivic To cite this version: Mathieu Aubry,

More information

Supervised learning. y = f(x) function

Supervised learning. y = f(x) function Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the

More information

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets

More information

Discriminative classifiers for image recognition

Discriminative classifiers for image recognition Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study

More information

Matching and Predicting Street Level Images

Matching and Predicting Street Level Images Matching and Predicting Street Level Images Biliana Kaneva 1, Josef Sivic 2, Antonio Torralba 1, Shai Avidan 3, and William T. Freeman 1 1 Massachusetts Institute of Technology {biliana,torralba,billf}@csail.mit.edu

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

Window based detectors

Window based detectors Window based detectors CS 554 Computer Vision Pinar Duygulu Bilkent University (Source: James Hays, Brown) Today Window-based generic object detection basic pipeline boosting classifiers face detection

More information

Patch Descriptors. CSE 455 Linda Shapiro

Patch Descriptors. CSE 455 Linda Shapiro Patch Descriptors CSE 455 Linda Shapiro How can we find corresponding points? How can we find correspondences? How do we describe an image patch? How do we describe an image patch? Patches with similar

More information

Category-level localization

Category-level localization Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object

More information

Instance-level recognition part 2

Instance-level recognition part 2 Visual Recognition and Machine Learning Summer School Paris 2011 Instance-level recognition part 2 Josef Sivic http://www.di.ens.fr/~josef INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d Informatique,

More information

Efficient Category Mining by Leveraging Instance Retrieval

Efficient Category Mining by Leveraging Instance Retrieval 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops Efficient Category Mining by Leveraging Instance Retrieval Abhinav Goel Mayank Juneja C. V. Jawahar Center for Visual Information

More information

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

Specular 3D Object Tracking by View Generative Learning

Specular 3D Object Tracking by View Generative Learning Specular 3D Object Tracking by View Generative Learning Yukiko Shinozuka, Francois de Sorbier and Hideo Saito Keio University 3-14-1 Hiyoshi, Kohoku-ku 223-8522 Yokohama, Japan shinozuka@hvrl.ics.keio.ac.jp

More information

Detection III: Analyzing and Debugging Detection Methods

Detection III: Analyzing and Debugging Detection Methods CS 1699: Intro to Computer Vision Detection III: Analyzing and Debugging Detection Methods Prof. Adriana Kovashka University of Pittsburgh November 17, 2015 Today Review: Deformable part models How can

More information

Instance-level recognition II.

Instance-level recognition II. Reconnaissance d objets et vision artificielle 2010 Instance-level recognition II. Josef Sivic http://www.di.ens.fr/~josef INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d Informatique, Ecole Normale

More information

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS Alexey Dosovitskiy, Jost Tobias Springenberg and Thomas Brox University of Freiburg Presented by: Shreyansh Daftry Visual Learning and Recognition

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

Lecture 12 Recognition. Davide Scaramuzza

Lecture 12 Recognition. Davide Scaramuzza Lecture 12 Recognition Davide Scaramuzza Oral exam dates UZH January 19-20 ETH 30.01 to 9.02 2017 (schedule handled by ETH) Exam location Davide Scaramuzza s office: Andreasstrasse 15, 2.10, 8050 Zurich

More information

Patch Descriptors. EE/CSE 576 Linda Shapiro

Patch Descriptors. EE/CSE 576 Linda Shapiro Patch Descriptors EE/CSE 576 Linda Shapiro 1 How can we find corresponding points? How can we find correspondences? How do we describe an image patch? How do we describe an image patch? Patches with similar

More information

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients ThreeDimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Authors: Zhile Ren, Erik B. Sudderth Presented by: Shannon Kao, Max Wang October 19, 2016 Introduction Given an

More information

HISTOGRAMS OF ORIENTATIO N GRADIENTS

HISTOGRAMS OF ORIENTATIO N GRADIENTS HISTOGRAMS OF ORIENTATIO N GRADIENTS Histograms of Orientation Gradients Objective: object recognition Basic idea Local shape information often well described by the distribution of intensity gradients

More information

Local features: detection and description. Local invariant features

Local features: detection and description. Local invariant features Local features: detection and description Local invariant features Detection of interest points Harris corner detection Scale invariant blob detection: LoG Description of local patches SIFT : Histograms

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Detecting Object Instances Without Discriminative Features

Detecting Object Instances Without Discriminative Features Detecting Object Instances Without Discriminative Features Edward Hsiao June 19, 2013 Thesis Committee: Martial Hebert, Chair Alexei Efros Takeo Kanade Andrew Zisserman, University of Oxford 1 Object Instance

More information

Linear combinations of simple classifiers for the PASCAL challenge

Linear combinations of simple classifiers for the PASCAL challenge Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu

More information

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213) Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding

More information

Multiple-Person Tracking by Detection

Multiple-Person Tracking by Detection http://excel.fit.vutbr.cz Multiple-Person Tracking by Detection Jakub Vojvoda* Abstract Detection and tracking of multiple person is challenging problem mainly due to complexity of scene and large intra-class

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Object Recognition II

Object Recognition II Object Recognition II Linda Shapiro EE/CSE 576 with CNN slides from Ross Girshick 1 Outline Object detection the task, evaluation, datasets Convolutional Neural Networks (CNNs) overview and history Region-based

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Course Administration

Course Administration Course Administration Project 2 results are online Project 3 is out today The first quiz is a week from today (don t panic!) Covers all material up to the quiz Emphasizes lecture material NOT project topics

More information

Recognition. Topics that we will try to cover:

Recognition. Topics that we will try to cover: Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) Object classification (we did this one already) Neural Networks Object class detection Hough-voting techniques

More information

Learning to Match Images in Large-Scale Collections

Learning to Match Images in Large-Scale Collections Learning to Match Images in Large-Scale Collections Song Cao and Noah Snavely Cornell University Ithaca, NY, 14853 Abstract. Many computer vision applications require computing structure and feature correspondence

More information

Lecture 12 Recognition

Lecture 12 Recognition Institute of Informatics Institute of Neuroinformatics Lecture 12 Recognition Davide Scaramuzza 1 Lab exercise today replaced by Deep Learning Tutorial Room ETH HG E 1.1 from 13:15 to 15:00 Optional lab

More information

Adaptive Rendering for Large-Scale Skyline Characterization and Matching

Adaptive Rendering for Large-Scale Skyline Characterization and Matching Adaptive Rendering for Large-Scale Skyline Characterization and Matching Jiejie Zhu, Mayank Bansal, Nick Vander Valk, and Hui Cheng Vision Technologies Lab, SRI International, Princeton, NJ 08540, USA

More information

Local Image Features

Local Image Features Local Image Features Ali Borji UWM Many slides from James Hayes, Derek Hoiem and Grauman&Leibe 2008 AAAI Tutorial Overview of Keypoint Matching 1. Find a set of distinctive key- points A 1 A 2 A 3 B 3

More information

HOG-based Pedestriant Detector Training

HOG-based Pedestriant Detector Training HOG-based Pedestriant Detector Training evs embedded Vision Systems Srl c/o Computer Science Park, Strada Le Grazie, 15 Verona- Italy http: // www. embeddedvisionsystems. it Abstract This paper describes

More information

CS6670: Computer Vision

CS6670: Computer Vision CS6670: Computer Vision Noah Snavely Lecture 16: Bag-of-words models Object Bag of words Announcements Project 3: Eigenfaces due Wednesday, November 11 at 11:59pm solo project Final project presentations:

More information

CS229: Action Recognition in Tennis

CS229: Action Recognition in Tennis CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active

More information

OBJECT CATEGORIZATION

OBJECT CATEGORIZATION OBJECT CATEGORIZATION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it Slides: Ing. Lamberto Ballan November 18th, 2009 What is an Object? Merriam-Webster Definition: Something material that may be

More information

Supervised learning. y = f(x) function

Supervised learning. y = f(x) function Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the

More information

Local Features and Bag of Words Models

Local Features and Bag of Words Models 10/14/11 Local Features and Bag of Words Models Computer Vision CS 143, Brown James Hays Slides from Svetlana Lazebnik, Derek Hoiem, Antonio Torralba, David Lowe, Fei Fei Li and others Computer Engineering

More information

Recap Image Classification with Bags of Local Features

Recap Image Classification with Bags of Local Features Recap Image Classification with Bags of Local Features Bag of Feature models were the state of the art for image classification for a decade BoF may still be the state of the art for instance retrieval

More information

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj Proceedings of the 5th International Symposium on Communications, Control and Signal Processing, ISCCSP 2012, Rome, Italy, 2-4 May 2012 ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING

More information

Shape Anchors for Data-driven Multi-view Reconstruction

Shape Anchors for Data-driven Multi-view Reconstruction Shape Anchors for Data-driven Multi-view Reconstruction Andrew Owens MIT CSAIL Jianxiong Xiao Princeton University Antonio Torralba MIT CSAIL William Freeman MIT CSAIL andrewo@mit.edu xj@princeton.edu

More information

Selective Search for Object Recognition

Selective Search for Object Recognition Selective Search for Object Recognition Uijlings et al. Schuyler Smith Overview Introduction Object Recognition Selective Search Similarity Metrics Results Object Recognition Kitten Goal: Problem: Where

More information

Object Category Detection: Sliding Windows

Object Category Detection: Sliding Windows 04/10/12 Object Category Detection: Sliding Windows Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical

More information

6.819 / 6.869: Advances in Computer Vision

6.819 / 6.869: Advances in Computer Vision 6.819 / 6.869: Advances in Computer Vision Image Retrieval: Retrieval: Information, images, objects, large-scale Website: http://6.869.csail.mit.edu/fa15/ Instructor: Yusuf Aytar Lecture TR 9:30AM 11:00AM

More information

Modeling 3D viewpoint for part-based object recognition of rigid objects

Modeling 3D viewpoint for part-based object recognition of rigid objects Modeling 3D viewpoint for part-based object recognition of rigid objects Joshua Schwartz Department of Computer Science Cornell University jdvs@cs.cornell.edu Abstract Part-based object models based on

More information

Category vs. instance recognition

Category vs. instance recognition Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building

More information

Part Discovery from Partial Correspondence

Part Discovery from Partial Correspondence Part Discovery from Partial Correspondence Subhransu Maji Gregory Shakhnarovich Toyota Technological Institute at Chicago, IL, USA Abstract We study the problem of part discovery when partial correspondence

More information

Bag-of-features. Cordelia Schmid

Bag-of-features. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image

More information

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012) Three things everyone should know to improve object retrieval Relja Arandjelović and Andrew Zisserman (CVPR 2012) University of Oxford 2 nd April 2012 Large scale object retrieval Find all instances of

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

Local features and image matching. Prof. Xin Yang HUST

Local features and image matching. Prof. Xin Yang HUST Local features and image matching Prof. Xin Yang HUST Last time RANSAC for robust geometric transformation estimation Translation, Affine, Homography Image warping Given a 2D transformation T and a source

More information

Evaluation of image features using a photorealistic virtual world

Evaluation of image features using a photorealistic virtual world Evaluation of image features using a photorealistic virtual world The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Instance-level recognition I. - Camera geometry and image alignment

Instance-level recognition I. - Camera geometry and image alignment Reconnaissance d objets et vision artificielle 2011 Instance-level recognition I. - Camera geometry and image alignment Josef Sivic http://www.di.ens.fr/~josef INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire

More information

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? Introduction What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? What are we trying to achieve? Example from Scott Satkin 3D interpretation

More information

Fuzzy based Multiple Dictionary Bag of Words for Image Classification

Fuzzy based Multiple Dictionary Bag of Words for Image Classification Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2196 2206 International Conference on Modeling Optimisation and Computing Fuzzy based Multiple Dictionary Bag of Words for Image

More information

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking Feature descriptors Alain Pagani Prof. Didier Stricker Computer Vision: Object and People Tracking 1 Overview Previous lectures: Feature extraction Today: Gradiant/edge Points (Kanade-Tomasi + Harris)

More information

Local Features: Detection, Description & Matching

Local Features: Detection, Description & Matching Local Features: Detection, Description & Matching Lecture 08 Computer Vision Material Citations Dr George Stockman Professor Emeritus, Michigan State University Dr David Lowe Professor, University of British

More information

Bundling Features for Large Scale Partial-Duplicate Web Image Search

Bundling Features for Large Scale Partial-Duplicate Web Image Search Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research Abstract In state-of-the-art image retrieval systems, an image is

More information

Local features: detection and description May 12 th, 2015

Local features: detection and description May 12 th, 2015 Local features: detection and description May 12 th, 2015 Yong Jae Lee UC Davis Announcements PS1 grades up on SmartSite PS1 stats: Mean: 83.26 Standard Dev: 28.51 PS2 deadline extended to Saturday, 11:59

More information

UNSUPERVISED OBJECT MATCHING AND CATEGORIZATION VIA AGGLOMERATIVE CORRESPONDENCE CLUSTERING

UNSUPERVISED OBJECT MATCHING AND CATEGORIZATION VIA AGGLOMERATIVE CORRESPONDENCE CLUSTERING UNSUPERVISED OBJECT MATCHING AND CATEGORIZATION VIA AGGLOMERATIVE CORRESPONDENCE CLUSTERING Md. Shafayat Hossain, Ahmedullah Aziz and Mohammad Wahidur Rahman Department of Electrical and Electronic Engineering,

More information

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability

Midterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability Midterm Wed. Local features: detection and description Monday March 7 Prof. UT Austin Covers material up until 3/1 Solutions to practice eam handed out today Bring a 8.5 11 sheet of notes if you want Review

More information

3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry

3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry 3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry Bryan C. Russell et al. SIGGRAPH Asia 2013 Presented by YoungBin Kim 2014. 01. 10 Abstract Produce annotated 3D

More information

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky 2 Comparison to Volumetric Fusion Higher-order ray

More information

Single Image Super-resolution. Slides from Libin Geoffrey Sun and James Hays

Single Image Super-resolution. Slides from Libin Geoffrey Sun and James Hays Single Image Super-resolution Slides from Libin Geoffrey Sun and James Hays Cs129 Computational Photography James Hays, Brown, fall 2012 Types of Super-resolution Multi-image (sub-pixel registration) Single-image

More information

Can Similar Scenes help Surface Layout Estimation?

Can Similar Scenes help Surface Layout Estimation? Can Similar Scenes help Surface Layout Estimation? Santosh K. Divvala, Alexei A. Efros, Martial Hebert Robotics Institute, Carnegie Mellon University. {santosh,efros,hebert}@cs.cmu.edu Abstract We describe

More information

A Novel Extreme Point Selection Algorithm in SIFT

A Novel Extreme Point Selection Algorithm in SIFT A Novel Extreme Point Selection Algorithm in SIFT Ding Zuchun School of Electronic and Communication, South China University of Technolog Guangzhou, China zucding@gmail.com Abstract. This paper proposes

More information

Object Category Detection. Slides mostly from Derek Hoiem

Object Category Detection. Slides mostly from Derek Hoiem Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Recognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei)

Recognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei) Recognition with Bag-ofWords (Borrowing heavily from Tutorial Slides by Li Fei-fei) Recognition So far, we ve worked on recognizing edges Now, we ll work on recognizing objects We will use a bag-of-words

More information

A model for full local image interpretation

A model for full local image interpretation A model for full local image interpretation Guy Ben-Yosef 1 (guy.ben-yosef@weizmann.ac.il)) Liav Assif 1 (liav.assif@weizmann.ac.il) Daniel Harari 1,2 (hararid@mit.edu) Shimon Ullman 1,2 (shimon.ullman@weizmann.ac.il)

More information

Part-Based Models for Object Class Recognition Part 3

Part-Based Models for Object Class Recognition Part 3 High Level Computer Vision! Part-Based Models for Object Class Recognition Part 3 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de! http://www.d2.mpi-inf.mpg.de/cv ! State-of-the-Art

More information

Image classification Computer Vision Spring 2018, Lecture 18

Image classification Computer Vision Spring 2018, Lecture 18 Image classification http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 2018, Lecture 18 Course announcements Homework 5 has been posted and is due on April 6 th. - Dropbox link because course

More information

Representing 3D models for alignment and recognition

Representing 3D models for alignment and recognition Representing 3D models for alignment and recognition Mathieu Aubry To cite this version: Mathieu Aubry. Representing 3D models for alignment and recognition. Computer Vision and Pattern Recognition [cs.cv].

More information

CS 231A Computer Vision (Winter 2014) Problem Set 3

CS 231A Computer Vision (Winter 2014) Problem Set 3 CS 231A Computer Vision (Winter 2014) Problem Set 3 Due: Feb. 18 th, 2015 (11:59pm) 1 Single Object Recognition Via SIFT (45 points) In his 2004 SIFT paper, David Lowe demonstrates impressive object recognition

More information

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013 CS4670 / 5670: Computer Vision Noah Snavely Bag-of-words models Object Bag of words Bag of Words Models Adapted from slides by Rob Fergus and Svetlana Lazebnik 1 Object Bag of words Origin 1: Texture Recognition

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Combining Appearance and Topology for Wide

Combining Appearance and Topology for Wide Combining Appearance and Topology for Wide Baseline Matching Dennis Tell and Stefan Carlsson Presented by: Josh Wills Image Point Correspondences Critical foundation for many vision applications 3-D reconstruction,

More information

Determining Patch Saliency Using Low-Level Context

Determining Patch Saliency Using Low-Level Context Determining Patch Saliency Using Low-Level Context Devi Parikh 1, C. Lawrence Zitnick 2, and Tsuhan Chen 1 1 Carnegie Mellon University, Pittsburgh, PA, USA 2 Microsoft Research, Redmond, WA, USA Abstract.

More information

Large-scale visual recognition The bag-of-words representation

Large-scale visual recognition The bag-of-words representation Large-scale visual recognition The bag-of-words representation Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline Bag-of-words Large or small vocabularies? Extensions for instance-level

More information

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image

More information

Photo Tourism: Exploring Photo Collections in 3D

Photo Tourism: Exploring Photo Collections in 3D Photo Tourism: Exploring Photo Collections in 3D SIGGRAPH 2006 Noah Snavely Steven M. Seitz University of Washington Richard Szeliski Microsoft Research 2006 2006 Noah Snavely Noah Snavely Reproduced with

More information

Blocks that Shout: Distinctive Parts for Scene Classification

Blocks that Shout: Distinctive Parts for Scene Classification Blocks that Shout: Distinctive Parts for Scene Classification Mayank Juneja 1 Andrea Vedaldi 2 C. V. Jawahar 1 Andrew Zisserman 2 1 Center for Visual Information Technology, International Institute of

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

2 Related Work. 2.1 Logo Localization

2 Related Work. 2.1 Logo Localization 3rd International Conference on Multimedia Technology(ICMT 2013) Logo Localization and Recognition Based on Spatial Pyramid Matching with Color Proportion Wenya Feng 1, Zhiqian Chen, Yonggan Hou, Long

More information

Immediate, scalable object category detection

Immediate, scalable object category detection Immediate, scalable object category detection Yusuf Aytar Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford Abstract The objective of this work is object category

More information

Feature Based Registration - Image Alignment

Feature Based Registration - Image Alignment Feature Based Registration - Image Alignment Image Registration Image registration is the process of estimating an optimal transformation between two or more images. Many slides from Alexei Efros http://graphics.cs.cmu.edu/courses/15-463/2007_fall/463.html

More information

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2 High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv Class of Object

More information