EECS 442 Computer vision. 3D Object Recognition and. Scene Understanding

Size: px

Start display at page:

Download "EECS 442 Computer vision. 3D Object Recognition and. Scene Understanding"

Prosper Lambert
6 years ago
Views:

1 EECS 442 Computer vision 3D Object Recognition and Scene Understanding

2 Interpreting the visual world Object: Building 8-10 meters away Object: Traffic light Object: Car, ¾ view 2-3 meters away

3 How can we achieve all of this? 3D modeling no semantic Semantic reasoning no 3D geometry Joint 3D modeling and semantic reasoning Chen & Medioni, 92 Debevec et al 96 Pollefeys et al 02 Nister 04 Hartley & Zisserman, 00 Levoy et al., 00 Brown & Lowe, 04 Schindler et al 08 Snavely et al 08 Agarwal et al 09 Etc

4 How can we achieve all of this? 3D modeling no semantic Semantic no 3D geometry Weber et al. 00 Felzenszwalb & Huttenlocher, 00 Ullman et al. 02 Fergus et al. 03 Torralba et al. 03 Leibe & Schiele, 04 Kumar & Hebert 04 Fei-Fei & Perona, 05 Sivic et al. 05 Shotton et al 05 Grauman et al. 05 Lazebnik et al, 06 Maji & Malik, 07 Vedaldi & Soatto 08 Zhu et al 08 Etc

range data disjoint 3D modeling and recognition Huber 01 Rusu et

5 Courtesy of Adan et al, 2011 How can we achieve all of this? 3D modeling no semantic Semantic no 3D geometry Semantic from range data disjoint 3D modeling and recognition Huber 01 Rusu et al. 08 Brostow et al. 08 Son & Kim 10 Tang et al. 10 Adan et al. 11 etc

6 How can we achieve all of this? 3D modeling no semantic Semantic no 3D geometry Semantic from range data disjoint 3D modeling and recognition Joint 3D modeling and semantic reasoning Hoiem et al Gould et al. 09 Hedau et al. 09 Gupta et al, 10 Ladick y et al, 10 Bao, Sun, Savarese 10 Sun, Bao, Savarese 10 Bao & Savarese, 11

7 Joint 3D modeling and recognition Objects and their geometrical attributes provide constraints for estimating the scene layout Given the scene the layout, objects can be detected more robustly

8 In this lecture. 3D Object detectors Robust to view point transformation Allow to estimate pose, scale and 3D shape Methods for coherent object detection and scene layout estimation single view multi-view videos

9 3D Object Detectors Azimuth, Zenith Viewing sphere Detect objects under generic view points Estimate object pose General and work for any object category

10 3D Object Detectors Detect objects under generic view points Estimate object pose General and work for any object category

11 3D Object Categorization

04 Leibe et al. 04 Kumar & Hebert 04 Sivic et al. 05 Shotton et al 05 Grauman et al.

12 Single view object categorization Leung et al 99 Weber et al. 00 Ullman et al. 02 Fergus et al. 03 Torralba et al. 03 Felzenszwalb & Huttenlocher 03 Fei-Fei et al. 04 Leibe et al. 04 Kumar & Hebert 04 Sivic et al. 05 Shotton et al 05 Grauman et al. 05 Sudderth et al 05 Torralba et al. 05 Lazebnik et al. 06 Todorovic et al. 06 Bosh et al 07 Vedaldi & Soatto 08

Single 3D object recognition Ballard, 81

13 Single 3D object recognition Ballard, 81 Grimson & L.-Perez, 87 Lowe, 87 Edelman et al. 91 Ullman & Barsi, 91 Rothwell 92 Linderberg, 94 Murase & Nayar 94 Zhang et al 95 Schmid & Mohr, 96 Schiele & Crowley, 96 Lowe, 99 Jacob & Barsi, 99 Rothganger et al., 04 Ferrari et al, 05 Brown & Lowe 05 Snavely et al 06 Yin & Collins, 07

14 3D Object Categorization

15 3D Object Categorization Mixture of 2D single view models Weber et al. 00 Schneiderman et al. 01 Bart et al. 04 Gu & Ren, 10 3D models - Explicit 3d models - Implicit 3d models Chiu et al. 07 Hoiem, et al., 07 Yan, et al. 07 Thomas et al. 06 Kushal, et al., 07 Savarese et al, 07, 08

16 Mixture of 2D models Weber et al. 00 Schneiderman et al. 01 Ullman et al. 02 Fergus et al. 03 Torralba et al. 03 Felzenszwalb & Huttenlocher 03 Leibe et al. 04 Shotton et al. 05 Grauman et al. 05 Savarese et al, 06 Todorovic et al. 06 Vedaldi & Soatto 08 Zhu et al 08 Gu & Ren, 10 Single view model Single view model 3D Category model CONS: Single view models are independent Non scalable to large number of categories/view-points Just b. boxes Cannot estimate 3D pose or 3D layout

17 3D Object Categorization Mixture of 2D single view models Weber et al. 00 Schneiderman et al. 01 Bart et al. 04 Gu & Ren, 10 3D models - Implicit 3d models - Explicit 3d models Chiu et al. 07 Hoiem, et al., 07 Yan, et al. 07. Xiang & Savarese 12 Thomas et al. 06 Kushal, et al., 07 Savarese et al, 07, 08 Sun et al. 09

18 Implicit 3D models 3D Category model Sparse set of interest points or parts of the objects are linked across views by implicit 3D transformations (H, F)

19 Linking features or parts across views: Perspective or affine transformation constraints x x x = H x

20 Linking features or parts across views: Epipolar Transformation Constraints x l = F T x x l = F T x x l

21 Implicit 3D models by ISM representations Thomas et al. 06 Leibe et al. 04 Multi-view model Sparse set of interest points or parts of the objects are linked across views.

22 Implicit 3D models by ISM representations Region tracks [Ferrari et al. 04, 06] Courtesy of Thomas et al. 06 Set of region-tracks connecting model views Each track is composed of image regions of a single physical surface patch along the model views in which it is visible.

23 Results Implicit 3D models by ISM representations

Implicit 3D models by graph-based representations Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Sun, et al, CVPR 2009, ICCV 09 Canonical parts captures view invariant diagnostic

24 Implicit 3D models by graph-based representations Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Sun, et al, CVPR 2009, ICCV 09 Canonical parts captures view invariant diagnostic appearance information 2d ½ structure linking parts via weak geometry Parts and relationship are modeled in a probabilistic fashion Parameters are learnt so as to maximize detection accuracy

25 Parameterization on view-sphere S T Model the object as collection of parts for any T and S on the viewing sphere

26 Multi-view generative part-based model T, S Yn=Codeword Xn=Location Yn=Codeword Xn=Location Image

27 Multi-view generative part-based model = Part Prop. Prior T, S K V = Part Appearance = Part Location/shape A Yn=Codeword Xn=Location Image

28 Multi-view generative part-based model Learning: estimate the latent variables and relevant parameters, given the observations Variational EM can be used Blei, ICML = Part Prop. Prior T, S K V = Part Appearance = Part Location/shape A Yn=Codeword Xn=Location Image

29 Incorporating geometrical constraints T i m M i j Within triangle constraints: M i j m i m j m j Encoded as a penalty term in variational EM

30 Incorporating geometrical constraints = Center T = Shape View morphing constraints: Seitz &Dyer SIGGRAPH 96 Xiao & Shah CVIU 04 Encoded as a penalty term in variational EM

31 Incorporating geometrical constraints S. M. Seitz and C. R. Dyer, Proc. SIGGRAPH 96, 1996, 21-30

32 h h h h h O P P P I,,, : k k k k k O P P P I,,, : Sequential ransac J-linkage [toldo et al 07] Defining initial parts and part correspondences Initializing the model

33 Semi-supervised Class label Object bounding box No part labels No pose labels [unlike Sun CVPR 09] No need to observe same object instance from multiple views [unlike Savarese & Fei-Fei, 07, 08]

34 Incremental learning T Enable unorganized and on-line collection training images Increase efficiency in learning (no need large storage space)

35 Incremental learning T Assign new training image to a triangle of the view sphere Evidence of training image is used to update model parameters Re-estimate sufficient statistics in a iterative fashion

36 Evolution of learnt parts

37 Examples of learnt part-based models Car 38

38 Examples of learnt part-based models Travel iron 39

39 Experimental results Object detection from any viewing angles Accurate estimation of the object pose PASCAL 2006 dataset 3D Object Dataset

40 Car 41

41 Travel Iron 42

42 Detection - 3D Object Dataset Car Bicycle ROC ROC Our model Sun et al, CVPR 09 Savarese & Fei-Fei ICCV 07

43 Viewpoint Classification 3D object dataset 0º 45º 90º V1 V2 V3 135º V4 180º V5 225º V6 270º V7 315º V8 Classification Accuracy Our model Savarese ICCV 07

44 Predicting object appearance from novel views Viewing sphere?

45 Predicting object appearance [For natural scenes, see Hoiem et al 07; Saxena et al 07] from novel views Thomas et al 08 Cremer et al 09

47 Predicting object appearance from novel views Affine transformation Our model

48 3D Object Categorization Mixture of 2D single view models Weber et al. 00 Schneiderman et al. 01 Bart et al. 04 Gu & Ren, 10 3D models - explicit 3d models - Implict 3d models Chiu et al. 07 Hoiem, et al., 07 Yan, et al. 07 Thomas et al. 06 Kushal, et al., 07 Savarese et al, 07, 08

49 Explicit 3D Models Chiu et al. 07 Hoiem, et al., 07 Yan, et al. 07 Xiang & Svarese, 12 H ij 3D Category model Part configuration is modeled as a conditional random fields with maximal margin parameter estimation Enable 6DOF object pose estimation 3D layout estimation of object parts

50 Explicit 3D Models Xiang & Savarese, CVPR 12 Mixture of 2D single view models Weber et al. 00 Schneiderman et al. 01 Bart et al. 04 Gu & Ren, 10 3D models - Explicit 3D models - Implicit 3D models Chiu et al. 07 Hoiem, et al., 07 Yan, et al. 07 Thomas et al. 06 Kushal, et al., 07 Savarese et al, 07, 08 [3D object dataset, 07]

51 In this lecture. 3D Object detectors Robust to view point transformation Allow to estimate pose, scale and 3D shape Methods for coherent object detection and scene layout estimation single view multi-view videos

supporting planes No assumptions on cameras Work both in indoors and outdoors Hoiem et

52 3D scene understanding from a single image Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012 Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras Work both in indoors and outdoors Hoiem et al Gould et al. 09 Hedau et al. 09 Lee et al. 09, 10 Gupta et al, 10, 11 Tsai et al. 11

3D scene understanding from a single image Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012 Coherent probabilistic model captures relationship between

53 3D scene understanding from a single image Bao, Sun, Savarese, CVPR 2010; BMVC 2010; IJCV 2012 Coherent probabilistic model captures relationship between objects and supporting planes No assumptions on cameras Work both in indoors and outdoors Hoiem et al Gould et al. 09 Hedau et al. 09 Gupta et al, 10, 11

54 In this lecture. 3D Object detectors Robust to view point transformation Allow to estimate pose, scale and 3D shape Methods for coherent object detection and scene layout estimation single view multi-view videos

55 3D scene understanding from multiple images Bao & S. Savarese, CVPR 2011 Bao, Bagra, Savarese. CORP ICCV 2011 Bao, Bagra, Chao, Savarese, CVPR 2012 Bao, Xiang, Savarese, ECCV 2012 Semantic Structure from Motion (SSFM) Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T Huber 01 Rusu et al. 08 Brostow et al. 08 Son & Kim 10 Tang et al. 10 Adan et al. 11 etc

56 Semantic Structure from Motion (SSFM) Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

57 Factor graph Semantic Structure from Motion (SSFM) Y CO Y CQ Y CB Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

58 SSFM: point-level compatibility Y CQ Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

59 SSFM: point-level compatibility observation projection Point re-projection error Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T Tomasi & Kanade 92 Triggs et al 99 Soatto & Perona 99 Hartley & Zisserman 00 Dellaert et al. 00 Pollefeys & V. Gool 02 Nister 04 Brown & Lowe 07 Snavely et al. 08

60 SSFM: Object-level compatibility Y CO Object re-projection error Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

61 SSFM: Object-level compatibility Camera 1 Camera 2 Agreement with measurements is computed using position, pose and scale

62 SSFM: Object-level compatibility Class = car scale=1 pose= back A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Su et al, ICCV 2009 Sun, et al, CVPR 2009 Sun et al, ECCV 2010 Yu & Savarese, CVPR 2012

SSFM: Object-level compatibility Class = car scale=3

value (probability) that an object class c with scale s

Savarese, Fei-Fei, ECCV 08 Su et al, ICCV 2009 Sun, et

63 SSFM: Object-level compatibility Class = car scale=3 pose= 3/4 A 3D object detector returns the confidence value (probability) that an object class c with scale s and pose p is found at x,y Savarese, Fei-Fei, ICCV 07 Savarese, Fei-Fei, ECCV 08 Su et al, ICCV 2009 Sun, et al, CVPR 2009 Sun et al, ECCV 2010 Yu & Savarese, CVPR 2012

64 SSFM: Object-level compatibility Class = car scale=1 pose= back Class = car scale=1 pose= 3/4 Camera 1 Camera 2 Efficiently implemented using a parallel computing architecture

65 SSFM: Region-level compatibility Y CB Region re-projection error Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

66 SSFM with interactions Bao, Bagra, Chao, Savarese CVPR 2012 Y QO Y CO Y OB Y CQ Y CB Y QB Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

(x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) x

67 SSFM with interactions Bao, Bagra, Chao, Savarese CVPR 2012 Object-Point Interactions: x x Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) x Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

68 SSFM with interactions Point-Region Interactions: x x x Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

69 SSFM with interactions Object-Region Interactions: Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

70 SSFM with interactions Object-Region Interactions: Measurements I Points (x,y,scale) Objects (x,y, scale, pose) Regions (x,y, pose) Model Parameters: Q = 3D points O = 3D objects B = 3D regions = cam. prm. K, R, T

71 Solving the SSFM problem Modified Markov Chain Monte Carlo (MCMC) sampling algorithm F. Dellaert, S. Seitz, S. Thrun, and C. Thorpe. Feature correspondence: A markov chain monte carlo approach. In NIPS, 2000 Initialization of the cameras, objects, and points are critical for the sampling Initialize configuration of cameras using: SFM consistency of object/region properties across views

Results Public Ford Campus Vision and LiDAR

72 Results Public Ford Campus Vision and LiDAR Dataset [Pandey et al, International Journal of Robotics Research, 2011] Object categories: Cars Ground truth depth provided by LiDAR In-house Office dataset Object categories: mugs, mice, keyboards Ground truth depth provided by Kinect In-house Street dataset Object categories: humans No ground truth depth available

73 Segmentation View N Detections View 1 Results Observations Joint reconstruction & recognition

75 Segmentation View N Detections View 1 Results Observations Joint reconstruction & recognition

77 View N Detections View 1 Results Observations Joint reconstruction & recognition SSFM Source code available!

78 FORD CAMPUS (cars) Object detection results Average precision in detecting objects (cars) in the 2D image DPM [1] SSFM (2011) with 2 views SSFM (2012) with 2 views SSFM (2012) with 4 views 54.5% 61.3% 62.8% 66.5% Accuracy in localizing objects in the 3D space (AP) Hoiem [2] SSFM [2011] SSFM [2012] FORD CAMPUS cars 21.4% 32.7% 43.1% OFFICE keyboards, mice, monitors 15.5% 20.2% 21.6% [1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2009 [2] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.

79 Camera estimation results Camera parameter reconstruction errors Camera translation error SFM [1] SSFM (2011) SSFM (2012) FORD CAMPUS OFFICE STREET Camera rotation error SFM [1] SSFM (2011) <1 <1 <1 SSFM (2012) [1] N. Snavely, S. M. Seitz, and R. S. Szeliski. Modeling the world from internet photo collections. IJCV, (2), Nov. 2008

80 Results

81 e T Results

82 Source code available!

83 In this lecture. 3D Object detectors Robust to view point transformation Allow to estimate pose, scale and 3D shape Methods for coherent object detection and scene layout estimation single view multi-view videos

84 Joint 3D modeling and recognition from videos Monocular cameras Un-calibrated cameras Arbitrary motion Highly cluttered scenes Occlusion Background clutter Moving targets Wu et al 07, Breitenstein et al 09, Zhao et al 04 Ess et al 09 Choi & Shahid & Savarese, WMC 2010 Choi & Savarese, ECCV 2010

in 3D Easily add additional evidence 3d depth IMU, etc 5 frames/second!

85 Joint tracking and camera estimation Interest Points in 3D Tracked Interest Points Camera Parameters Pedestrian Detections Target Location in 3D Easily add additional evidence 3d depth IMU, etc 5 frames/second! Code available on line soon! Ω : set of state variables Χ : set of observations

87 Safe Driving Applications

88 Autonomous navigation

89 Conclusions Intelligent vision requires joint reconstruction - recognition Geometry provides critical contextual cues for robust recognition High level semantics help establish robust geometrical constraints for reconstruction Within a single view Across views High level semantics help scalability in reconstruction problems Fewer images are needed with wider baseline

90 EECS 442 Computer vision Hope you have enjoyed this class! Good luck with your projects & presentations!

Lecture 15 Visual recognition

Lecture 15 Visual recognition 3D object detection Introduction Single instance 3D object detectors Generic 3D object detectors Silvio Savarese Lecture 15-26-Feb-14 3D object detection Object: Building