Representing 3D models with discriminative visual elements
|
|
- Aron Julian Eaton
- 5 years ago
- Views:
Transcription
1 Representing 3D models with discriminative visual elements Mathieu Aubry June 23th 2014 Josef Sivic (INRIA-ENS) Bryan Russell (Intel) Painting-to-3D Model Alignment Via Discriminative Visual Elements M. Aubry, B. Russell and J. Sivic, TOG 2014 (will be presented at SIGGRAPH 2014)
2
3
4
5 Motivation: browsing visual content Intelligent visual memory: organize and search your visual record
6 Motivation: history / archeology New ways to access and analyse data for archeology, history or architecture [Russell, Sivic, Ponce, Dessales, 2011] Example: evolution of a particular place over time
7 Motivation: reasoning on non realistic images There are many non-photographic depictions of our world : Ultimate goal: to reason about these depictions
8 From the beginning of computer vision First PhD in computer vision, MIT 1963 Lawrence G. Roberts Machine perception of three-dimensional solids Photograph 3D model
9 1980s: 2D-3D Alignment [Huttenlocher and Ullman IJCV 1990] Alignment: Huttenlocher & Ullman (198 [Lowe AI 1987] (a) (b) (c) [Faugeras&Hebert 86], [Grimson&Lozano-Perez 86], igure 2.2: Object recognit ion example from t he work of Lowe (1987) [88]. (a) 3D wire-fram
10 Difficulty Limits of local feature matching using SIFT: Figure from [A. Shrivastava, T. Malisiewicz, A. Gupta, A. Efros, Data-driven Visual Similarity for Cross-domain Image Matching, SIGGRAPH Asia 2011] Very large space of possible viewpoints. See also: [Chum & Matas CVPR 2006] [Schechtman & Irani, CVPR 2007] [Russell, Sivic, Ponce, Dessalles 2011] [Hauagge & Snavely CVPR 2012]
11 Related work: retrieval using a global descriptor [Russell, Sivic, Ponce, Dessales, 2011] used GIST [Oliva and Torralba 2001] [Shrivastava, Malisiewicz, Gupta, Efros, 2011]
12 Related work: retrieval using contours [Baatz, Saurer, Köser, Pollefeys, 12] [Russell, Sivic, Ponce, Dessales, 2011] [Lowe 87], [Huttenlocher&Ullman 99], [Shotton, Blake, Cipolla 05], [Ferrari, Fevrier, Jurie, Schmid 08], [Opelt, Pintz, Zisserman 06], [Arandjelovic&Zisserman 11], [Baboud, Cadik, Eisemann, Seidel, 2011],
13 Inspiration: category recognition based on HOGs Introduced in [Dalal and Triggs 2005] Extension to object parts [Felzenszwalb et al. 2010] Fast approximation [Hariharan, Malik, Ramanan. 2012]
14 Inspiration: mid-level visual elements Learn a vocabulary of discriminative visual elements that characterize a city. [Doersch, Singh, Gupta, Sivic, Efros, What makes Paris look like Paris?, SIGGRAPH 2012] See also [Bourdev & Malik ICCV 2009], [Singh et al. ECCV 2012], [Juneja et al. CVPR 2013], [Jain et al. CVPR 2013],
15 High level idea Limitation of egde detection Soft edge representation (HOG, SIFT ) Machine learning (DPM, CNN ) Key points detection Dense evaluation RANSAC Grouping (Bag of words ) Category level recognition 3D scene reconstruction Place / instance recognition Bring together these methods again!
16 This work: 3D discriminative visual elements Summarize a 3D model with a set of discriminative elements view-dependent distinct 3D fragments
17 Problem statement Inputs Output 3D model Painting Camera parameters Camera center, rotation, principal point, focal length
18 I. Synthesize representative views Synthesize ~10,000 viewpoints for an architectural site See also: [Irschara et al. CVPR 2009], [Baatz et al. ECCV 2012]
19 I. Synthesize representative views Synthesize ~10,000 viewpoints for an architectural site We will have views similar to the paintings, that makes the problem easier! See also: [Irschara et al. CVPR 2009], [Baatz et al. ECCV 2012]
20 II. Select informative patches Idea: match the views using informative patches.
21 II. Select informative patches Idea: match the views using informative patches.?? Problem: how to select the informative patches?
22 II. Select informative patches Discriminability in patch space mn
23 II. Select informative patches Discriminability in patch space F mn F(m)
24 II. Select informative patches Discriminability in patch space F mn F(m)
25 II. Select informative patches Discriminability in patch space F mn ints. Camera positions are sampled 100 grid. 24 camera orientations are iewing any portion of the 3D model n about 45,000 valid views for this in-plane rotation of the camera. ple 2-3 vertical rotations (pitch el. Views where no significant re discarded. This procedure reviews depending on the size of pled camera positions isshown viewsform only an intermediate d after element detectors are ex- x Fig. 3. M atching as classification. Given a region and its HOG descriptor q in a rendered view (top left) the aim is to find the corresponding region F(m)
26 4 M. Aubry et al. II. Select informative patches Discriminability in patch space mn ints. Camera positions are sampled 100 grid. 24 camera orientations are iewing any portion of the 3D model n about 45,000 valid views for this in-plane rotation of the camera. ple 2-3 vertical rotations (pitch el. Views where no significant re discarded. This procedure reviews depending on the size of pled camera positions isshown viewsform only an intermediate d after element detectors are ex- x Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular grid. 24 camera orientations are used for F each viewpoint. Cameras not viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the 3D model isvisible are discarded. This procedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Note also that as we aim to use 3D models of different type and quality, we did not use any advanced rendering techniques, e.g. to produce illumination effects such as shadows or specularities. Learning discriminative visual elements from rendered views is described next. F(m) 4.2 Finding discriminative candidate elements The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, we formulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for Fig. 3. M atching as classification. thousands Given of candidate a region and visual its HOG elements descriptor densely sampled in each q in a rendered view (top left) rendered the aimview. is tothe findquality the corresponding of the trained region classifier (measured by x Fig. 3. M atching ascla q in a rendered view (top in a painting (top right). sliding window classifier of negative data. The clas showing the positive (+) and spatial locations. The imum of the classification learnt w weights diffe contrast tothestandard x have the same weigh ing per-exemplar dist support vector machin for object recognition tava et al. 2011]. Here matching using local m Parameters w and b
27 II. Select informative patches Discriminability in patch space Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on a regular grid. 24 camera orientations are used for each viewpoint. Cameras not viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. l. mn viewpoints. Camera positions are sampled ar grid. 24 camera orientations are ras not viewing any portion of the 3D model results in about 45,000 valid views for this x tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch F angles) depending on the 3D model. Views where no significant portion of the 3D model isvisible are discarded. This procedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Note also that as we aim to use 3D models of different type and quality, we did not use any advanced rendering techniques, e.g. to produce illumination effects such as shadows or specularities. Learning discriminative visual elements from rendered views is described next. 4.2 Finding discriminative candidate elements F(m) The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, we formulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual ing no in-plane rotation of the camera. wesample 2-3 vertical rotations (pitch 3D model. Views where no significant visible are discarded. This procedure re- 45,000 views depending on the size of the sampled camera positions isshown elements are matched to the input painting, and relate the proposed ndered viewsform only an intermediate Fig. 3. M atchingapproach as classification. to other Given recent a region work on andclosed-form its HOG descriptor training of HOG- Fig. 3. M atching ascla q in a rendered view (top in a painting x (top right). sliding window classifier of negative data. The clas showing the positive (+) and spatial locations. The imum of the classification learnt w weights diffe contrast tothestandard x have the same weigh ing per-exemplar dist support vector machin for object recognition tava et al. 2011]. Here matching using local m Parameters w and b of the following form E (w, b) = L 1,w
28 II. Select informative patches Discriminability in patch space mn x F x F(q1) F(q2) x x q2 F(m) Big F(q) = discriminative
29 to-3d Model Alignment Via Discriminative Visual Elements 5 can w- (2) nts terthat and in can (4) w- (2) nts terthat (5) and les, Note: in III. Select informative patches We evaluate he whitened norm densely on the rendered view and perform non-max suppression. to-3d Model Alignment Via Discriminative Visual Elements 5 (3) (6) (3) Fig. 4. Selection of discr iminative visual elements. First row: the value of discriminability shown as a heat-map for three different scales (left to right). Red indicates high discriminability. Blueindicates low discriminability. The discriminability is inversely proportional to the training cost of a classifier learnt from a patch at the particular image location. Second row: example visual elements at the local maxima of the discriminability score. - The Can corresponding be thought of local as a maxima generalization are also indicated of local feature using x detection. in the heatmaps Equivalent above. to minimizing least square classification - loss
30 III. Select stable patches Filter out elements unstable across viewpoint. 3D model provides ground truth matches in near-by views Require elements to be reliably detectable in near-by views See also [Doersch et al. SIGGRAPH 2012] [Singh et al. ECCV 2012], [Juneja et al. CVPR 2013]
31 Summary: representation of the 3D model Back-project learnt discriminative elements onto the 3D model
32 IV. Matching patches Query region q: 1. Represent query region q using HOG descriptor
33 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Query region q: tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match 1. Represent query measured region by the Euclidean q distance using betweenhog the descriptors, descriptor we train a linear classifier with q as a single positive example (with label y q =+1) and a large number of negativeexamples x i for i = 1 to N (with labels y i = 1). The matching is then performed by finding the patch x in the painting with the highest classification score s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. Parameters w and b are obtained by minimizing a cost function of the following form E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011] Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto au-
34 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto auw tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements wpoints. Camera positions are are sampled grid. Query grid camera orientations are are not viewing viewing region any any q: portion of of the the 3D 3Dmodel in ltsabout in about 45,000 45,000 valid valid views views for for this this The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match measured by the Euclidean distance between the descriptors, we no1. Represent query region of q the using HOG in-plane rotation of the camera. to N (with labels y sample vertical vertical rotations i (pitch model. Views Views where where score no no significant train a linear classifier with q as a single positive example (with label y 2. Train a linear classifier f(x) = w T q =+1) and a large number of negativeexamples x x+b using i for i = 1 = 1). The matching is then performed by q as a finding the patch x positive example and large in the painting with the highest classification number of negatives s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image Here matchingused using local mid-level for weighted image structures. matching Parameters w and b are obtained by minimizing a cost function of the following form w + w - E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011].
35 Classifier training Train classifier for each candidate region q: {q,+1}, {x i,-1} for i = 1..N (set of generic negatives) q Example: hinge loss (e-svm) Example: square loss
36 Classifier training For square loss, cost E can be minimized in closed form [Bach&Harchaoui 2008; Gharbi et al. 2012; Hariharan et al. 2012]
37 4 M. Aubry et al. IV. Matching patches Fig. 2. Example of sampled viewpoints. Camera positions are sampled on the ground plane on aregular grid. 24 camera orientations are used for each viewpoint. Camerasnot viewing any portion of the 3D model are discarded. This procedure results in about 45,000 valid views for this 3D model. q Selection of discriminative visual elements via least squares regression. So far wehave assumed that the position and scale of the visual element q in the rendered view is given. As storing and matching all possible visual elements from all rendered viewswould be computationally prohibitive, the aim here isto auw tal camera rotations assuming no in-plane rotation of the camera. For each horizontal rotation wesample 2-3 vertical rotations (pitch angles) depending on the 3D model. Views where no significant portion of the3d model isvisible arediscarded. Thisprocedure results in between 7,000 and 45,000 views depending on the size of the 3D site. An example of the sampled camera positions isshown in figure 2. Note that the rendered viewsform only an intermediate representation and can be discarded after element detectors are extracted. Notealso that asweaim to use 3D modelsof different type and quality, wedid not useany advanced rendering techniques, e.g. to produce illumination effects such as shadowsor specularities. Learning discriminative visual elements from rendered viewsis described next. 4.2 Finding discriminative candidate elements wpoints. Camera positions are are sampled grid. Query grid camera orientations are are not viewing viewing region any any q: portion of of the the 3D 3Dmodel in ltsabout in about 45,000 45,000 valid valid views views for for this this The aim is to find a set of mid-level visual elements that are discriminative for the given 3D site. In the following, weformulate image matching as a discriminative classification task and show that for a specific choice of loss function the classifier can be computed in a closed-form without computationally expensive iterative training. In turn, this enables efficient training of classifiers for thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, weshow how the learnt visual elements are matched to the input painting, and relate theproposed approach to other recent work on closed-form training of HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012] Matching as classification. The aim isto match a given rectangular patch q (represented by ahog [Dalal and Triggs2005] descriptor) in a rendered view to its corresponding patch in the painting, as illustrated in figure 3. Instead of finding thebest match no in-plane rotation of of the the camera. to N (with labels y sample 2-3 vertical i 2-3 vertical rotations (pitch model. Views Views where where score no no significant 1. Represent query measured region by the Euclidean q using distance between HOG the descriptors, descriptor we train a linear classifier with q as a single 2. Train a linear classifier f(x) = w T positive example (with label y q =+1) and a large number of negativeexamples x x+b using i for i = 1 = 1). The matching is then performed by q as a finding the patch x positive example and large in the painting with the highest classification number of negatives s(x) = w > x + b, (1) where w and b are the parameters of the linear classifier. Note that w denotes the normal vector to the decision hyper-plane and b isa Fig. 3. Matching asclassification. Given aregion and itshog descriptor q in a rendered view (top left) the aim is to find the corresponding region in a painting (top right). This is achieved by training a linear HOG-based sliding window classifier using q as a positive example and a large number of negative data. The classifier weight vector w is visualized by separately showing the positive (+) and negative (-) weights at different orientations and spatial locations. The best match x in thepainting isfound asthemaximum of theclassification score. See also exemplar SVM by [Malisiewicz learnt w weights different components of x differently. This isin contrast tothestandard Euclidean distancewhereall componentsof et xal., have thesame ICCV 11], weight. Note that a[shrivastava similar idea was used in learn- et al. 11] ing per-exemplar distances [Frome et al. 2007] or per-exemplar support vector machine (SVM) classifiers [Malisiewicz et al. 2011] for object recognition and cross-domain image retrieval [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image Here matchingused using local mid-level for weighted image structures. matching Parameters w and b are obtained by minimizing a cost function of the following form w + w - E (w, b) = L 1,w T q + b + 1 N X N i =1 L 1,w T x i + b, (2) where the first term measures the loss L on the positive example q and the second term measures the loss on the negative data. Note that for simplicity weignore in(2) theregularization term w 2.A particular case of this approach isthe exemplar-svm [Malisiewicz et al. 2011; Shrivastava et al. 2011], where the loss L (y, s(x)) between the label y and predicted score s(x) is the hinge-loss L (y, s(x)) = max{ 0, 1 ys(x)} [Bishop 2006]. For exemplar SVM cost (2) is convex and can be minimized using iterative algorithms [Fan et al. 2008; Shalev-Shwartz et al. 2011].
38 4.2.2 Selection of discriminative visual elements via least squares regression. Selection of Sodiscriminative far we have assumed visual that elements the position via least and squares scale ofregression. the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven. Asstor- on the onground the ground plane plane on aon regular a regular grid. grid camera orientations are are used used for each for each viewpoint. Cameras Cameras not not viewing any any portion of of the the 3D 3Dmodel are discarded. are discarded. This This procedure procedure results results in about in about 45,000 45,000 valid valid views views for for this this 3D model. 3D model. Query region q: Matching patches tal camera tal camera rotations rotations assuming assuming no no in-plane rotation of of the the camera. For each For each horizontal horizontal rotation rotation wesample vertical vertical rotations (pitch angles) angles) depending depending on the on the 3D 3D model. model. Views Views where where no no significant portion portion of the of 3D the model 3D model is isvisible are are discarded. This This procedure resultsults in between in between 7,000 7,000 and and 45,000 45,000 views views depending on on the the size size of of re- the 3D thesite. 3D site. An example An example of the of the sampled sampled camera camera positions isshown in figure in figure 2. Note 2. Note that that the the rendered rendered viewsform only only an an intermediate representation representation and and can can be discarded be discarded after after element element detectors detectors are are extractedtracted. Note Note also also that that as we as we aim aim to use to use 3D 3D models models of of different different type type ex- and quality, and quality, wedid not did not use use any any advanced advanced rendering rendering techniques, e.g. e.g. to produce to produce illumination illumination effects effects such such as shadows as shadows or or specularities. Learning Learning discriminative discriminative visual visual elements elements from fromrendered rendered viewsis is described described next. next Finding Finding discriminative discriminative candidate candidate elements elements The The aim aim is toisfind find a set a set of mid-level of mid-level visual visual elements elements that that are are discriminative discriminative for the for the given given 3D 3D site. site. In the In the following, following, we we formulate formulate image image matching matching as a as discriminative a discriminative classification classification task task and and show show that for that a for specific a specific choice choice of loss of loss function function the the classifier classifier can can be be computed computed in a in closed-form a closed-form without without computationally computationally expensive expensive iterative training. iterative training. In turn, In turn, this this enables enables efficient efficient training training of of classifiers classifiers for for thousands of candidate visual elements densely sampled in each thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the training error) is then used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual the lowest training error). Finally, we show how the learnt visual elements are matched to the input painting, and relate the proposed elements approach are matched to other to recent the input work painting, on closed-form and relate training the proposed of HOGbased approach to linear other classifiers recent work [Gharbi on et closed-form al. 2012; Hariharan training et of al. HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012]. 2012] Matching as classification. The aim isto match a given rectangular Matching patch as q (represented classification. by ahog The aim [Dalal isto and match Triggs a given 2005] rectangular descriptor) patch in q a (represented rendered view by ahog to its corresponding [Dalal and Triggs patch 2005] in the descriptor) painting, inasa illustrated rendered in view figure to 3. itsinstead corresponding of finding patch the best inmatch the painting, measured illustrated by the Euclidean in figure 3. distance Instead between of finding the the descriptors, best matchwe Matching asclassification. Given aregion and its HOG descriptor where the first term measures the loss on the positive example q where and the the second first term term measures measures the the loss loss L on on the the negative positive data. example Note q and that the for second simplicity term weignore measures in(2) the theregularization loss on the negative term data. w 2 Note.A that particular for simplicity case of weignore this approach in(2) isthe theregularization exemplar-svm term [Malisiewicz w 2.A particular et al. 2011; case Shrivastava of this approach et al. isthe 2011], exemplar-svm where the loss [Malisiewicz L (y, s(x)) etbetween al. 2011; theshrivastava label y and etpredicted al. 2011], score where s(x) theisloss hinge-loss L (y, s(x)) between L (y, s(x)) the = label max{ y and 0, 1 predicted ys(x)} [Bishop score s(x) 2006]. is the For hinge-loss exemplar L SVM (y, s(x)) cost (2) = isconvex max{ 0, 1and ys(x)} can be minimized [Bishop 2006]. using iterative For exemplar algo- SVM cost (2) is convex and can be minimized using iterative algo- 1. Represent measured train aby linear query theclassifier Euclidean region with distance q as a single between q using positive the descriptors, example HOG (with we descriptor la- a linear y trainbel 2. Train a linear q =+1) classifier and alarge with qnumber as a single of negative positiveexamples classifier f(x) = w T x(with x+b i for i label yto q =+1) N (withand labels a large y i = number 1). The of negativeexamples matching is then performed x i i = 1by to N = 1 rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding (with the labels patch y i x= 3. Find the best match in 1). thethe painting matching with the is then highest performed classification by rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding the patch x in the painting in the with the painting highest classification maximizing the classification score f(x) score score s(x) = w > x + b, (1) s(x) = w > x + b, (1) w + w - Fig. Fig M atching asclassification. Given aregion and its HOG descriptor q qinina arendered view (top left) the aim is to find the corresponding region in ina apainting (top (topright). This is isachieved by training a linear HOG-based sliding sliding window classifier using q as a positive example and a large number of of negative negative data. data. The Theclassifier weight vector w isvisualized by by separately showing showing the thepositive (+) (+) and and negative (-) weights at at different orientations and and spatial spatial locations. locations. The Thebest match x in the painting is isfound as asthe themax- imum imum of ofthe theclassification score. score. learnt learnt ww weights weightsdifferent components of x differently. This Thisisisin in contrast contrast to to thestandard thestandard Euclidean Euclidean distance whereall whereall components components of of x x have have the the same same weight. weight. Note Note that that a similar similar idea idea was was used used in in learning learning per-exemplar per-exemplar distances distances [Frome [Frome et et al. al. 2007] 2007] or or per-exemplar per-exemplar support support vector vector machine(svm) machine (SVM) classifiers classifiers [Malisiewicz [Malisiewicz et et al. al. 2011] 2011] for for object object recognition recognition and and cross-domain cross-domain image image retrieval retrieval [Shrivastava et al. 2011]. Here, we build on this work and apply it to image [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. matching using local mid-level image structures. Parameters and are obtained by minimizing cost function Parameters w and b are obtained by minimizing a cost function of the following form of the following form E (w, b) = L 1,w T E (w, b) = L 1,w T q + b + 1 N X N N i =1 i =1 1,w T i b, (2) L 1,w T x i + b, (2)
39 4.2.2 Selection of discriminative visual elements via least squares regression. Selection of Sodiscriminative far we have assumed visual that elements the position via least and squares scale ofregression. the visual element So far qwehave in the rendered assumed view that isthe given. position As stor- and scale of the visual element q in the rendered view isgiven. Asstor- on the onground the ground plane plane on aon regular a regular grid. grid camera orientations are are used used for each for each viewpoint. Cameras Cameras not not viewing any any portion of of the the 3D 3Dmodel are discarded. are discarded. This This procedure procedure results results in about in about 45,000 45,000 valid valid views views for for this this 3D model. 3D model. Query region q: IV. Matching patches tal camera tal camera rotations rotations assuming assuming no no in-plane rotation of of the the camera. For each For each horizontal horizontal rotation rotation wesample vertical vertical rotations (pitch angles) angles) depending depending on the on the 3D 3D model. model. Views Views where where no no significant portion portion of the of 3D the model 3D model is isvisible are are discarded. This This procedure resultsults in between in between 7,000 7,000 and and 45,000 45,000 views views depending on on the the size size of of re- the 3D thesite. 3D site. An example An example of the of the sampled sampled camera camera positions isshown in figure in figure 2. Note 2. Note that that the the rendered rendered viewsform only only an an intermediate representation representation and and can can be discarded be discarded after after element element detectors detectors are are extractedtracted. Note Note also also that that as we as we aim aim to use to use 3D 3D models models of of different different type type ex- and quality, and quality, wedid not did not use use any any advanced advanced rendering rendering techniques, e.g. e.g. to produce to produce illumination illumination effects effects such such as shadows as shadows or or specularities. Learning Learning discriminative discriminative visual visual elements elements from fromrendered rendered viewsis is described described next. next Finding Finding discriminative discriminative candidate candidate elements elements The The aim aim is toisfind find a set a set of mid-level of mid-level visual visual elements elements that that are are discriminative discriminative for the for the given given 3D 3D site. site. In the In the following, following, we we formulate formulate image image matching matching as a as discriminative a discriminative classification classification task task and and show show that for that a for specific a specific choice choice of loss of loss function function the the classifier classifier can can be be computed computed in a in closed-form a closed-form without without computationally computationally expensive expensive iterative training. iterative training. In turn, In turn, this this enables enables efficient efficient training training of of classifiers classifiers for for thousands of candidate visual elements densely sampled in each thousands of candidate visual elements densely sampled in each rendered view. The quality of the trained classifier (measured by rendered view. The quality of the trained classifier (measured by the training error) isthen used to select only the few candidate visual elements that are the most discriminative in each view (have the training error) is then used to select only the few candidate visual elements that are the most discriminative in each view (have the lowest training error). Finally, we show how the learnt visual the lowest training error). Finally, we show how the learnt visual elements are matched to the input painting, and relate the proposed elements approach are matched to other to recent the input work painting, on closed-form and relate training the proposed of HOGbased approach to linear other classifiers recent work [Gharbi on et closed-form al. 2012; Hariharan training et of al. HOGbased linear classifiers [Gharbi et al. 2012; Hariharan et al. 2012]. 2012] Matching as classification. The aim isto match a given rectangular Matching patch as q (represented classification. by ahog The aim [Dalal isto and match Triggs a given 2005] rectangular descriptor) patch in q a (represented rendered view by ahog to its corresponding [Dalal and Triggs patch 2005] in the descriptor) painting, inasa illustrated rendered in view figure to 3. itsinstead corresponding of finding patch the best inmatch the painting, measured illustrated by the Euclidean in figure 3. distance Instead between of finding the the descriptors, best matchwe Best match: where the first term measures the loss on the positive example q where and the the second first term term measures measures the the loss loss L on on the the negative positive data. example Note q and that the for second simplicity term weignore measures in(2) the theregularization loss on the negative term data. w 2 Note.A that particular for simplicity case of weignore this approach in(2) isthe theregularization exemplar-svm term [Malisiewicz w 2.A particular et al. 2011; case Shrivastava of this approach et al. isthe 2011], exemplar-svm where the loss [Malisiewicz L (y, s(x)) etbetween al. 2011; theshrivastava label y and etpredicted al. 2011], score where s(x) theisloss hinge-loss L (y, s(x)) between L (y, s(x)) the = label max{ y and 0, 1 predicted ys(x)} [Bishop score s(x) 2006]. is the For hinge-loss exemplar L SVM (y, s(x)) cost (2) = isconvex max{ 0, 1and ys(x)} can be minimized [Bishop 2006]. using iterative For exemplar algo- SVM cost (2) is convex and can be minimized using iterative algo- 1. Represent measured train aby linear query theclassifier Euclidean region with distance q as a single between q using positive the descriptors, example HOG (with we descriptor la- a linear y trainbel 2. Train a linear q =+1) classifier and alarge with qnumber as a single of negative positiveexamples classifier f(x) = w T x(with x+b i for i label yto q =+1) N (withand labels a large y i = number 1). The of negativeexamples matching is then performed x i i = 1by to N = 1 rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding (with the labels patch y i x= 3. Find the best match in 1). thethe painting matching with the is then highest performed classification by rithms [Fan et al. 2008; Shalev-Shwartz et al. 2011]. finding the patch x in the painting in the with the painting highest classification maximizing the classification score f(x) score score s(x) = w > x + b, (1) s(x) = w > x + b, (1) w + w - Fig. Fig MMatching asclassification. Given aregion and its HOG descriptor q qinina arendered view (top left) the aim is to find the corresponding region in ina apainting (top (topright). This is isachieved by training a linear HOG-based sliding sliding window classifier using q as a positive example and a large number of of negative negative data. data. The Theclassifier weight vector w isvisualized by by separately showing showing the thepositive (+) (+) and and negative (-) weights at at different orientations and and spatial spatial locations. locations. The Thebest match x in the painting is isfound as asthe themax- imum imum of ofthe theclassification score. score. learnt learnt ww weights weightsdifferent components of x differently. This Thisisisin in contrast contrast to to thestandard thestandard Euclidean Euclidean distance whereall whereall components components of of x x have have the the same same weight. weight. Note Note that that a similar similar idea idea was was used used in in learning learning per-exemplar per-exemplar distances distances [Frome [Frome et et al. al. 2007] 2007] or or per-exemplar per-exemplar support support vector vector machine(svm) machine (SVM) classifiers classifiers [Malisiewicz [Malisiewicz et et al. al. 2011] 2011] for for object object recognition recognition and and cross-domain cross-domain image image retrieval retrieval [Shrivastava et al. 2011]. Here, we build on this work and apply it to image [Shrivastava et al. 2011]. Here, webuild on this work and apply it to image matching using local mid-level image structures. matching using local mid-level image structures. Parameters and are obtained by minimizing cost function Parameters w and b are obtained by minimizing a cost function of the following form of the following form E (w, b) = L 1,w T E (w, b) = L 1,w T q + b + 1 N X N N i =1 i =1 1,w T i b, (2) L 1,w T x i + b, (2)
40 V. Recover viewpoint 1. Run all the elements detectors and find best matches -> We need to compare the detector scores. This can be seen as a calibration problem. We use an affine calibration. Results on [Huagge and Snavely 2012] dataset
41 V. Recover viewpoint 1. Run all the elements detectors 2. Run RANSAC on the top 25 matches
42 Historical photograph Results
43 Historical photograph Results
44 Drawings Results
45 Drawings Results
46 Engravings Results
47 Engravings Results
48 Paintings Results
49 Watercolors Results
50 Watercolors Results
51 Results
52 Quantitative Results On a database of 337 depictions with 4 3D models:
53 CVPR 2014 work Oral session 9B, Friday 13:45 Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models M. Aubry, B. Russell, A Efros and J. Sivic CVPR 2014 (oral) Mathieu AUBRY 58 INRIA -TUM
54 CVPR 2014 work Use of large (1300) database of 3D model to help recognition. Mathieu AUBRY 59 INRIA -TUM
55 Approach Style Use rendered views from the 3D models. (62 views per model -> views) Mathieu AUBRY 60 Viewpoint INRIA -TUM
56 Main difference Calibration of the detectors calibrated score density 0.01% of the scores -1 Initial score Mathieu AUBRY 61 INRIA -TUM
57
58 Conclusion Mid-level visual elements 2D-3D alignment can help: Understanding non-photographic depictions Representing object categories Oral session 9B, Friday 13:45 Mathieu AUBRY 63 INRIA -TUM
59 Questions? Mathieu AUBRY 64 INRIA -TUM
Painting-to-3D Model Alignment Via Discriminative Visual Elements. Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA)
Painting-to-3D Model Alignment Via Discriminative Visual Elements Mathieu Aubry (INRIA) Bryan Russell (Intel) Josef Sivic (INRIA) This work Browsing across time 1699 1740 1750 1800 1800 1699 1740 1750
More informationSeeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models
Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models Mathieu Aubry (INRIA) Daniel Maturana (CMU) Alexei Efros (UC Berkeley) Bryan Russell (Intel) Josef Sivic (INRIA)
More informationPainting-to-3D Model Alignment Via Discriminative Visual Elements 1
Painting-to-3D Model Alignment Via Discriminative Visual Elements 1 Mathieu Aubry INRIA 2 / TU München and Bryan C. Russell Intel Labs and Josef Sivic INRIA 2 This paper describes a technique that can
More informationPainting-to-3D Model Alignment Via Discriminative Visual Elements
Painting-to-3D Model Alignment Via Discriminative Visual Elements Mathieu Aubry, Bryan C. Russell, Josef Sivic To cite this version: Mathieu Aubry, Bryan C. Russell, Josef Sivic. Painting-to-3D Model Alignment
More informationWhere was this picture painted? - Localizing paintings by alignment to 3D models
Where was this picture painted? - Localizing paintings by alignment to 3D models Mathieu Aubry, Bryan C. Russell, Josef Sivic To cite this version: Mathieu Aubry, Bryan C. Russell, Josef Sivic. Where was
More informationDevelopment in Object Detection. Junyuan Lin May 4th
Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,
More informationObject Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce
Object Recognition Computer Vision Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce How many visual object categories are there? Biederman 1987 ANIMALS PLANTS OBJECTS
More informationSeeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models
Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan C. Russell, Josef Sivic To cite this version: Mathieu Aubry,
More informationSupervised learning. y = f(x) function
Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the
More informationBeyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets
More informationDiscriminative classifiers for image recognition
Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study
More informationMatching and Predicting Street Level Images
Matching and Predicting Street Level Images Biliana Kaneva 1, Josef Sivic 2, Antonio Torralba 1, Shai Avidan 3, and William T. Freeman 1 1 Massachusetts Institute of Technology {biliana,torralba,billf}@csail.mit.edu
More informationDeformable Part Models
CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones
More informationWindow based detectors
Window based detectors CS 554 Computer Vision Pinar Duygulu Bilkent University (Source: James Hays, Brown) Today Window-based generic object detection basic pipeline boosting classifiers face detection
More informationPatch Descriptors. CSE 455 Linda Shapiro
Patch Descriptors CSE 455 Linda Shapiro How can we find corresponding points? How can we find correspondences? How do we describe an image patch? How do we describe an image patch? Patches with similar
More informationCategory-level localization
Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object
More informationInstance-level recognition part 2
Visual Recognition and Machine Learning Summer School Paris 2011 Instance-level recognition part 2 Josef Sivic http://www.di.ens.fr/~josef INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d Informatique,
More informationEfficient Category Mining by Leveraging Instance Retrieval
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops Efficient Category Mining by Leveraging Instance Retrieval Abhinav Goel Mayank Juneja C. V. Jawahar Center for Visual Information
More informationAction recognition in videos
Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit
More informationSpecular 3D Object Tracking by View Generative Learning
Specular 3D Object Tracking by View Generative Learning Yukiko Shinozuka, Francois de Sorbier and Hideo Saito Keio University 3-14-1 Hiyoshi, Kohoku-ku 223-8522 Yokohama, Japan shinozuka@hvrl.ics.keio.ac.jp
More informationDetection III: Analyzing and Debugging Detection Methods
CS 1699: Intro to Computer Vision Detection III: Analyzing and Debugging Detection Methods Prof. Adriana Kovashka University of Pittsburgh November 17, 2015 Today Review: Deformable part models How can
More informationInstance-level recognition II.
Reconnaissance d objets et vision artificielle 2010 Instance-level recognition II. Josef Sivic http://www.di.ens.fr/~josef INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire d Informatique, Ecole Normale
More informationLEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS
LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS Alexey Dosovitskiy, Jost Tobias Springenberg and Thomas Brox University of Freiburg Presented by: Shreyansh Daftry Visual Learning and Recognition
More informationPart based models for recognition. Kristen Grauman
Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily
More informationLecture 12 Recognition. Davide Scaramuzza
Lecture 12 Recognition Davide Scaramuzza Oral exam dates UZH January 19-20 ETH 30.01 to 9.02 2017 (schedule handled by ETH) Exam location Davide Scaramuzza s office: Andreasstrasse 15, 2.10, 8050 Zurich
More informationPatch Descriptors. EE/CSE 576 Linda Shapiro
Patch Descriptors EE/CSE 576 Linda Shapiro 1 How can we find corresponding points? How can we find correspondences? How do we describe an image patch? How do we describe an image patch? Patches with similar
More informationThree-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients
ThreeDimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Authors: Zhile Ren, Erik B. Sudderth Presented by: Shannon Kao, Max Wang October 19, 2016 Introduction Given an
More informationHISTOGRAMS OF ORIENTATIO N GRADIENTS
HISTOGRAMS OF ORIENTATIO N GRADIENTS Histograms of Orientation Gradients Objective: object recognition Basic idea Local shape information often well described by the distribution of intensity gradients
More informationLocal features: detection and description. Local invariant features
Local features: detection and description Local invariant features Detection of interest points Harris corner detection Scale invariant blob detection: LoG Description of local patches SIFT : Histograms
More informationPreviously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011
Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition
More informationDetecting Object Instances Without Discriminative Features
Detecting Object Instances Without Discriminative Features Edward Hsiao June 19, 2013 Thesis Committee: Martial Hebert, Chair Alexei Efros Takeo Kanade Andrew Zisserman, University of Oxford 1 Object Instance
More informationLinear combinations of simple classifiers for the PASCAL challenge
Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu
More informationRecognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)
Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding
More informationMultiple-Person Tracking by Detection
http://excel.fit.vutbr.cz Multiple-Person Tracking by Detection Jakub Vojvoda* Abstract Detection and tracking of multiple person is challenging problem mainly due to complexity of scene and large intra-class
More informationClassification of objects from Video Data (Group 30)
Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time
More informationObject Recognition II
Object Recognition II Linda Shapiro EE/CSE 576 with CNN slides from Ross Girshick 1 Outline Object detection the task, evaluation, datasets Convolutional Neural Networks (CNNs) overview and history Region-based
More informationPart-based and local feature models for generic object recognition
Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza
More informationCourse Administration
Course Administration Project 2 results are online Project 3 is out today The first quiz is a week from today (don t panic!) Covers all material up to the quiz Emphasizes lecture material NOT project topics
More informationRecognition. Topics that we will try to cover:
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) Object classification (we did this one already) Neural Networks Object class detection Hough-voting techniques
More informationLearning to Match Images in Large-Scale Collections
Learning to Match Images in Large-Scale Collections Song Cao and Noah Snavely Cornell University Ithaca, NY, 14853 Abstract. Many computer vision applications require computing structure and feature correspondence
More informationLecture 12 Recognition
Institute of Informatics Institute of Neuroinformatics Lecture 12 Recognition Davide Scaramuzza 1 Lab exercise today replaced by Deep Learning Tutorial Room ETH HG E 1.1 from 13:15 to 15:00 Optional lab
More informationAdaptive Rendering for Large-Scale Skyline Characterization and Matching
Adaptive Rendering for Large-Scale Skyline Characterization and Matching Jiejie Zhu, Mayank Bansal, Nick Vander Valk, and Hui Cheng Vision Technologies Lab, SRI International, Princeton, NJ 08540, USA
More informationLocal Image Features
Local Image Features Ali Borji UWM Many slides from James Hayes, Derek Hoiem and Grauman&Leibe 2008 AAAI Tutorial Overview of Keypoint Matching 1. Find a set of distinctive key- points A 1 A 2 A 3 B 3
More informationHOG-based Pedestriant Detector Training
HOG-based Pedestriant Detector Training evs embedded Vision Systems Srl c/o Computer Science Park, Strada Le Grazie, 15 Verona- Italy http: // www. embeddedvisionsystems. it Abstract This paper describes
More informationCS6670: Computer Vision
CS6670: Computer Vision Noah Snavely Lecture 16: Bag-of-words models Object Bag of words Announcements Project 3: Eigenfaces due Wednesday, November 11 at 11:59pm solo project Final project presentations:
More informationCS229: Action Recognition in Tennis
CS229: Action Recognition in Tennis Aman Sikka Stanford University Stanford, CA 94305 Rajbir Kataria Stanford University Stanford, CA 94305 asikka@stanford.edu rkataria@stanford.edu 1. Motivation As active
More informationOBJECT CATEGORIZATION
OBJECT CATEGORIZATION Ing. Lorenzo Seidenari e-mail: seidenari@dsi.unifi.it Slides: Ing. Lamberto Ballan November 18th, 2009 What is an Object? Merriam-Webster Definition: Something material that may be
More informationSupervised learning. y = f(x) function
Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the
More informationLocal Features and Bag of Words Models
10/14/11 Local Features and Bag of Words Models Computer Vision CS 143, Brown James Hays Slides from Svetlana Lazebnik, Derek Hoiem, Antonio Torralba, David Lowe, Fei Fei Li and others Computer Engineering
More informationRecap Image Classification with Bags of Local Features
Recap Image Classification with Bags of Local Features Bag of Feature models were the state of the art for image classification for a decade BoF may still be the state of the art for instance retrieval
More informationROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj
Proceedings of the 5th International Symposium on Communications, Control and Signal Processing, ISCCSP 2012, Rome, Italy, 2-4 May 2012 ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING
More informationShape Anchors for Data-driven Multi-view Reconstruction
Shape Anchors for Data-driven Multi-view Reconstruction Andrew Owens MIT CSAIL Jianxiong Xiao Princeton University Antonio Torralba MIT CSAIL William Freeman MIT CSAIL andrewo@mit.edu xj@princeton.edu
More informationSelective Search for Object Recognition
Selective Search for Object Recognition Uijlings et al. Schuyler Smith Overview Introduction Object Recognition Selective Search Similarity Metrics Results Object Recognition Kitten Goal: Problem: Where
More informationObject Category Detection: Sliding Windows
04/10/12 Object Category Detection: Sliding Windows Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical
More information6.819 / 6.869: Advances in Computer Vision
6.819 / 6.869: Advances in Computer Vision Image Retrieval: Retrieval: Information, images, objects, large-scale Website: http://6.869.csail.mit.edu/fa15/ Instructor: Yusuf Aytar Lecture TR 9:30AM 11:00AM
More informationModeling 3D viewpoint for part-based object recognition of rigid objects
Modeling 3D viewpoint for part-based object recognition of rigid objects Joshua Schwartz Department of Computer Science Cornell University jdvs@cs.cornell.edu Abstract Part-based object models based on
More informationCategory vs. instance recognition
Category vs. instance recognition Category: Find all the people Find all the buildings Often within a single image Often sliding window Instance: Is this face James? Find this specific famous building
More informationPart Discovery from Partial Correspondence
Part Discovery from Partial Correspondence Subhransu Maji Gregory Shakhnarovich Toyota Technological Institute at Chicago, IL, USA Abstract We study the problem of part discovery when partial correspondence
More informationBag-of-features. Cordelia Schmid
Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image
More informationThree things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)
Three things everyone should know to improve object retrieval Relja Arandjelović and Andrew Zisserman (CVPR 2012) University of Oxford 2 nd April 2012 Large scale object retrieval Find all instances of
More informationSelection of Scale-Invariant Parts for Object Class Recognition
Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract
More informationLocal features and image matching. Prof. Xin Yang HUST
Local features and image matching Prof. Xin Yang HUST Last time RANSAC for robust geometric transformation estimation Translation, Affine, Homography Image warping Given a 2D transformation T and a source
More informationEvaluation of image features using a photorealistic virtual world
Evaluation of image features using a photorealistic virtual world The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationInstance-level recognition I. - Camera geometry and image alignment
Reconnaissance d objets et vision artificielle 2011 Instance-level recognition I. - Camera geometry and image alignment Josef Sivic http://www.di.ens.fr/~josef INRIA, WILLOW, ENS/INRIA/CNRS UMR 8548 Laboratoire
More informationWhat are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?
Introduction What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? What are we trying to achieve? Example from Scott Satkin 3D interpretation
More informationFuzzy based Multiple Dictionary Bag of Words for Image Classification
Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2196 2206 International Conference on Modeling Optimisation and Computing Fuzzy based Multiple Dictionary Bag of Words for Image
More informationFeature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking
Feature descriptors Alain Pagani Prof. Didier Stricker Computer Vision: Object and People Tracking 1 Overview Previous lectures: Feature extraction Today: Gradiant/edge Points (Kanade-Tomasi + Harris)
More informationLocal Features: Detection, Description & Matching
Local Features: Detection, Description & Matching Lecture 08 Computer Vision Material Citations Dr George Stockman Professor Emeritus, Michigan State University Dr David Lowe Professor, University of British
More informationBundling Features for Large Scale Partial-Duplicate Web Image Search
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research Abstract In state-of-the-art image retrieval systems, an image is
More informationLocal features: detection and description May 12 th, 2015
Local features: detection and description May 12 th, 2015 Yong Jae Lee UC Davis Announcements PS1 grades up on SmartSite PS1 stats: Mean: 83.26 Standard Dev: 28.51 PS2 deadline extended to Saturday, 11:59
More informationUNSUPERVISED OBJECT MATCHING AND CATEGORIZATION VIA AGGLOMERATIVE CORRESPONDENCE CLUSTERING
UNSUPERVISED OBJECT MATCHING AND CATEGORIZATION VIA AGGLOMERATIVE CORRESPONDENCE CLUSTERING Md. Shafayat Hossain, Ahmedullah Aziz and Mohammad Wahidur Rahman Department of Electrical and Electronic Engineering,
More informationMidterm Wed. Local features: detection and description. Today. Last time. Local features: main components. Goal: interest operator repeatability
Midterm Wed. Local features: detection and description Monday March 7 Prof. UT Austin Covers material up until 3/1 Solutions to practice eam handed out today Bring a 8.5 11 sheet of notes if you want Review
More information3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry
3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry Bryan C. Russell et al. SIGGRAPH Asia 2013 Presented by YoungBin Kim 2014. 01. 10 Abstract Produce annotated 3D
More informationDiscrete Optimization of Ray Potentials for Semantic 3D Reconstruction
Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky 2 Comparison to Volumetric Fusion Higher-order ray
More informationSingle Image Super-resolution. Slides from Libin Geoffrey Sun and James Hays
Single Image Super-resolution Slides from Libin Geoffrey Sun and James Hays Cs129 Computational Photography James Hays, Brown, fall 2012 Types of Super-resolution Multi-image (sub-pixel registration) Single-image
More informationCan Similar Scenes help Surface Layout Estimation?
Can Similar Scenes help Surface Layout Estimation? Santosh K. Divvala, Alexei A. Efros, Martial Hebert Robotics Institute, Carnegie Mellon University. {santosh,efros,hebert}@cs.cmu.edu Abstract We describe
More informationA Novel Extreme Point Selection Algorithm in SIFT
A Novel Extreme Point Selection Algorithm in SIFT Ding Zuchun School of Electronic and Communication, South China University of Technolog Guangzhou, China zucding@gmail.com Abstract. This paper proposes
More informationObject Category Detection. Slides mostly from Derek Hoiem
Object Category Detection Slides mostly from Derek Hoiem Today s class: Object Category Detection Overview of object category detection Statistical template matching with sliding window Part-based Models
More informationSpatial Localization and Detection. Lecture 8-1
Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday
More informationMultiple Kernel Learning for Emotion Recognition in the Wild
Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,
More informationRecognition with Bag-ofWords. (Borrowing heavily from Tutorial Slides by Li Fei-fei)
Recognition with Bag-ofWords (Borrowing heavily from Tutorial Slides by Li Fei-fei) Recognition So far, we ve worked on recognizing edges Now, we ll work on recognizing objects We will use a bag-of-words
More informationA model for full local image interpretation
A model for full local image interpretation Guy Ben-Yosef 1 (guy.ben-yosef@weizmann.ac.il)) Liav Assif 1 (liav.assif@weizmann.ac.il) Daniel Harari 1,2 (hararid@mit.edu) Shimon Ullman 1,2 (shimon.ullman@weizmann.ac.il)
More informationPart-Based Models for Object Class Recognition Part 3
High Level Computer Vision! Part-Based Models for Object Class Recognition Part 3 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de! http://www.d2.mpi-inf.mpg.de/cv ! State-of-the-Art
More informationImage classification Computer Vision Spring 2018, Lecture 18
Image classification http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 2018, Lecture 18 Course announcements Homework 5 has been posted and is due on April 6 th. - Dropbox link because course
More informationRepresenting 3D models for alignment and recognition
Representing 3D models for alignment and recognition Mathieu Aubry To cite this version: Mathieu Aubry. Representing 3D models for alignment and recognition. Computer Vision and Pattern Recognition [cs.cv].
More informationCS 231A Computer Vision (Winter 2014) Problem Set 3
CS 231A Computer Vision (Winter 2014) Problem Set 3 Due: Feb. 18 th, 2015 (11:59pm) 1 Single Object Recognition Via SIFT (45 points) In his 2004 SIFT paper, David Lowe demonstrates impressive object recognition
More informationBag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013
CS4670 / 5670: Computer Vision Noah Snavely Bag-of-words models Object Bag of words Bag of Words Models Adapted from slides by Rob Fergus and Svetlana Lazebnik 1 Object Bag of words Origin 1: Texture Recognition
More informationMobile Human Detection Systems based on Sliding Windows Approach-A Review
Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg
More informationCombining Appearance and Topology for Wide
Combining Appearance and Topology for Wide Baseline Matching Dennis Tell and Stefan Carlsson Presented by: Josh Wills Image Point Correspondences Critical foundation for many vision applications 3-D reconstruction,
More informationDetermining Patch Saliency Using Low-Level Context
Determining Patch Saliency Using Low-Level Context Devi Parikh 1, C. Lawrence Zitnick 2, and Tsuhan Chen 1 1 Carnegie Mellon University, Pittsburgh, PA, USA 2 Microsoft Research, Redmond, WA, USA Abstract.
More informationLarge-scale visual recognition The bag-of-words representation
Large-scale visual recognition The bag-of-words representation Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline Bag-of-words Large or small vocabularies? Extensions for instance-level
More informationSIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014
SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014 SIFT SIFT: Scale Invariant Feature Transform; transform image
More informationPhoto Tourism: Exploring Photo Collections in 3D
Photo Tourism: Exploring Photo Collections in 3D SIGGRAPH 2006 Noah Snavely Steven M. Seitz University of Washington Richard Szeliski Microsoft Research 2006 2006 Noah Snavely Noah Snavely Reproduced with
More informationBlocks that Shout: Distinctive Parts for Scene Classification
Blocks that Shout: Distinctive Parts for Scene Classification Mayank Juneja 1 Andrea Vedaldi 2 C. V. Jawahar 1 Andrew Zisserman 2 1 Center for Visual Information Technology, International Institute of
More informationString distance for automatic image classification
String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,
More information2 Related Work. 2.1 Logo Localization
3rd International Conference on Multimedia Technology(ICMT 2013) Logo Localization and Recognition Based on Spatial Pyramid Matching with Color Proportion Wenya Feng 1, Zhiqian Chen, Yonggan Hou, Long
More informationImmediate, scalable object category detection
Immediate, scalable object category detection Yusuf Aytar Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford Abstract The objective of this work is object category
More informationFeature Based Registration - Image Alignment
Feature Based Registration - Image Alignment Image Registration Image registration is the process of estimating an optimal transformation between two or more images. Many slides from Alexei Efros http://graphics.cs.cmu.edu/courses/15-463/2007_fall/463.html
More informationPart-Based Models for Object Class Recognition Part 2
High Level Computer Vision Part-Based Models for Object Class Recognition Part 2 Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de https://www.mpi-inf.mpg.de/hlcv Class of Object
More information