The Three R s of Vision

Size: px

Start display at page:

Download "The Three R s of Vision"

Janice Hutchinson
6 years ago
Views:

1 The Three R s of Vision Recognition Reconstruction Reorganization Jitendra Malik UC Berkeley

2 Recognition, Reconstruction & Reorganization Recognition Reconstruction Reorganization

3 Fifty years of computer vision s: Beginnings in artificial intelligence, image processing and pattern recognition 1970s: Foundational work on image formation: Horn, Koenderink, Longuet-Higgins 1980s: Vision as applied mathematics: geometry, multi-scale analysis, probabilistic modeling, control theory, optimization 1990s: Geometric analysis largely completed, vision meets graphics, statistical learning approaches resurface 2000s: Significant advances in visual recognition, range of practical applications

4 Different aspects of vision Perception: study the laws of seeing -predict what a human would perceive in an image. Neuroscience: understand the mechanisms in the retina and the brain Function: how laws of optics, and the statistics of the world we live in, make certain interpretations of an image more likely to be valid The match between human and computer vision is strongest at the level of function, but since typically the results of computer vision are meant to be conveyed to humans makes it useful to be consistent with human perception. Neuroscience is a source of ideas but being bio-mimetic is not a requirement.

5 The Three R s of Vision Recognition Reconstruction Reorganization

6 The Three R s of Vision Recognition Reconstruction Reorganization Each of the 6 directed arcs in this diagram is a useful direction of information flow

7 Review Reconstruction Feature matching + multiple view geometry has led to city scale point cloud reconstructions Recognition 2D problems such as handwriting recognition, face detection successfully fielded in applications. Partial progress on 3d object category recognition Reorganization Progress on bottom-up segmentation hitting diminishing returns Semantic segmentation is the key problem now

8 Image-based Modeling Façade (1996) Debevec, Taylor & Malik Acquire photographs Recover geometry (explicit or implicit) Texture map

9 Campus Model of UC Berkeley Campanile + 40 Buildings (Debevec et al, 1997)

10 Arc de Triomphe

11 The Taj Mahal Taj Mahal modeled from one photograph by G. Borshukov

12 State of the Art in Reconstruction Multiple photographs Range Sensors Agarwal et al (2010) Frahm et al, (2010) Kinect (PrimeSense) Velodyne Lidar Semantic Segmentation is needed to make this more useful

13 Shape, Albedo, and Illumination from Shading Jonathan Barron Jitendra Malik UC Berkeley

14 Forward Optics Far Near shape / depth

15 Forward Optics Far Near shape / depth illumination

16 Forward Optics Far Near shape / depth log-shading image of Z and L illumination

17 Forward Optics Far Near shape / depth log-shading image of Z and L illumination log-albedo / log-reflectance

18 Forward Optics Far Near shape / depth log-shading image of Z and L illumination log-albedo / log-reflectance Lambertian reflectance in log-intensity

19 Shape, Albedo, and Illumination from Shading SAIFS ( safes ) Far??? Near shape / depth log-shading image of Z and L illumination? log-albedo / log-reflectance Lambertian reflectance in log-intensity

20 Problem Formulation: Known Lighting??? Find the most likely explanation (shape Z and log-albedo A) that together exactly reconstructs log-image I, given rendering engine S() and known illumination L.

21 Demo!

22 What do we know about reflectance? 1) Piecewise smooth (variation is small and sparse) 2) Palette is small (distribution is low-entropy) 3) Some colors are common (maximize likelihood under density model)

23 Reflectance: Absolute Color

24 What do we know about shapes? 1) Piecewise smooth (variation in mean curvature is small and sparse) 2) Face outward at the occluding contour 3) Tend to be fronto-parallel (slant tends to be small)

25 Evaluation: Real World Images

26 Evaluation: Real World Images

27 Recognition helps reconstruction Blanz & Vetter (1999) Geometric Context (Hoiem, Efros, Hebert) for outdoor scenes; recent work on rooms (CMU, UIUC) is another example

28 The Three R s of Vision Recognition Reconstruction Reorganization

29 Caltech-101 [Fei-Fei et al. 04] 102 classes, images/class

30 Caltech 101 classification results (even better by combining cues..)

31 Texton Histogram Model for Recognition (Leung & Malik, 1999) cf. Bag of Words ough Plastic Pebbles Plaster-b Terrycloth ICCV '99, Corfu, Greece

32 Lazebnik, Schmid & Ponce (2006) They proposed using vector-quantized SIFT descriptors as words

33 PASCAL Visual Object Challenge (Everingham et al)

36 A good building block is a linear SVM trained on HOG features (Dalal & Triggs)

38 AP=0.23

40 Problems with current recognition approaches Performance is quite poor compared to that at 2d recognition tasks and the needs of many applications. Pose Estimation / Localization of parts or keypoints is even worse. We can t isolate decent stick figures from radiance images, making use of depth data necessary. Progress has slowed down. Variations of HOG/Deformable part models dominate.

41 Principal Component 2 PCA Results on APs of 20 VOC classes 1 BROOKES(11) 0.8 UoCTTI(09) UoCTTI(10) 0.6 UoCTTI(11) UMNECUIUC(10) MISSOURI(11) NUS(10) NLPR(10) NLPR(11) CORNELL(11) OXFORD(09) BONN_FGT(10) BONN_SVR(10) NECUIUC(09) UCLA(10) NUS(11) OXFORD(11) UCLA(11) -0.8 UVA(11) -1 FGBG(10) UVA(10) Principal Component 1

42 Next steps in recognition Richer features than SIFT/HOG (deep learning?) Incorporate the shape bias known from child development literature to improve generalization This requires monocular computation of shape, as once posited in the 2.5D sketch, and distinguishing albedo and illumination changes from geometric contours Top down templates should predict keypoint locations and image support, not just information about category Recognition and figure-ground inference need to coevolve. Occlusion is signal, not noise.

44 High-Level Computer Vision

45 High-Level Computer Vision person van person Object Recognition dog

46 High-Level Computer Vision person van person Object Recognition Semantic Segmentation dog

47 High-Level Computer Vision In a back view Facing the camera Object Recognition Semantic Segmentation Pose Estimation Facing back, head to the right

48 High-Level Computer Vision Walking away talking Object Recognition Semantic Segmentation Pose Estimation Action Recognition

49 High-Level Computer Vision Man with glasses and a coat blue GMC van Entlebucher mountain dog elderly white man with a baseball hat Object Recognition Semantic Segmentation Pose Estimation Action Recognition Attribute Classification

High-Level Computer Vision A man with glasses and a coat, facing back, walking away A blue GMC van parked, in a back view An elderly man with a hat and glasses, facing

50 High-Level Computer Vision A man with glasses and a coat, facing back, walking away A blue GMC van parked, in a back view An elderly man with a hat and glasses, facing the camera and talking Object Recognition Semantic Segmentation Pose Estimation Action Recognition Attribute Classification An entlebucher mountain dog sitting in a bag

51 Trying to extract stick figures is hard (and unnecessary!) Generalized cylinders (Marr & Nishihara, Binford) Pictorial Structures (Felszenswalb & Huttenlocher)

52 All the wrong limbs

53 Motivation

54 Face Detection Carnegie Mellon University ous images submitted to the CMU on-li

55 Examples of poselets (Bourdev & Malik, 2009) Patches are often far visually, but they are close semantically

56 How do we train a poselet for a given pose configuration?

57 Finding Correspondences Given part of a human pose How do we find a similar pose configuration in the training set?

58 Finding Correspondences Left Shoulder Left Hip We use keypoints to annotate the joints, eyes, nose, etc. of people

59 Finding Correspondences Residual Error

60 Training poselet classifiers Residual Error: Given a seed patch 2. Find the closest patch for every other person 3. Sort them by residual error 4. Threshold them

61 Male or female?

62 How do we train attribute classifiers in the wild? Effective prediction requires inferring the pose and camera view Pose reconstruction is itself a hard problem, but we don t need perfect solution. We train attribute classifiers for each poselet Poselets implicitly decompose the pose

63 Gender classifier per poselet is much easier to train

64 Is male

65 Has long hair

66 Wears a hat

67 Wears glasses

68 Wears long pants

69 Wears long sleeves

70 Some discriminative poselets (Maji et al)

71 Armlets (Gkioxari et al, CVPR 2013)

72 Yang & Ramanan Our method Multiple Instances Right Arm Left Arm

73 Results Results of Augmented Armlets and Comparison with baseline [1] PCP Yang & Ramanan [1] Our model R_UpperArm R_Lower Arm L_Upper Arm L_Lower Arm Average [1] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. CVPR, 2011

74 The Three R s of Vision Recognition Reconstruction Reorganization

75 Berkeley Segmentation DataSet [BSDS] D. Martin, C. Fowlkes, D. Tal, J. Malik. "A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics", ICCV,

(2001), Boykov, Veksler & Zabih(2001) Arbelaez et al (2009), Martin, Fowlkes,

76 State of the Art in Reorganization Interactive segmentation using graph cuts Berkeley gpb edges & regions Rother, Kolmogorov & Blake (2004), Boykov & Jolly (2001), Boykov, Veksler & Zabih(2001) Arbelaez et al (2009), Martin, Fowlkes, Malik (2004), Shi & Malik (2000) We may be hitting the limits of bottom-up segmentation

77 What boundaries do you see?

78 Motion Boundaries ndberg et al, CVPR 2011; Brox & Malik, ECCV 201

79 Recognition Helps Reorganization

80 The Three R s of Vision Recognition Superpixel assemblies as candidates Reconstruction Reorganization

81 Semantic Segmentation using Regions and Parts P. Arbeláez, B. Hariharan, S. Gupta, C. Gu, L. Bourdev and J. Malik

82 This Work Top-down Part/Object Detectors Bottom-up Region Segmentation Cat Segmenter

83 Results on PASCAL VOC

84 Perceptual Robotics Using RGBD images to semantically parse scenes S. Gupta, P. Arbeláez & J. Malik (CVPR 2013)

Semantic Segmentation Compute features on

Classifier Color Image Bottom Up Segmentation into

blue is close, orange is far Normal Image

85 Using RGBD Images to Semantically Parse Scenes Input From Kinect-like depth sensors Reorganization Semantic Segmentation Compute features on superpixels, classify using SVMs as classifiers SVM Classifier Color Image Bottom Up Segmentation into superpixels Depth Image visualized in pseudo color blue is close, orange is far Normal Image visualized in pseudo color blue are surfaces facing up Long Range Linking

86 Semantic Segmentation Super Pixel Classification Category Pr wall 0.90 Classifier IK SVM cabinet 0.05 window 0.05 chair 0.0 table

87 Semantic Segmentation Affordance Based Features Geocentric Pose Orientation Features Height above ground Size Features Spatial extent Surface Area Is clipped/occluded Shape Features Planarity Strength of local geometric gradients Use orientation with respect to gravity, heights above ground, actual sizes Category Specific Features Scores of one-versus-rest SVMs using histogram of Vector Quantized SIFT Geocentric Textons

88 Semantic Segmentation

89 Semantic Segmentation Aggregate Performance Category wise performance [NYU] Our [NYU] Our [NYU] Our wall picture floor counter cabinet blinds bed desk chair shelves sofa curtain table dresser door pillow window mirror bookshelf floor mat NYU [Silberman et al ECCV12] Indoor segmentation and support inference from RGBD images.

90 Semantic Segmentation Performance some more categories [NYU] Our [NYU] Our clothes ceiling books refrigerator television paper towel shower curtain box person night stand toilet sink lamp bathtub bag otherstructure otherfurniture whiteboard [NYU] Silberman et al, ECCV12, Indoor segmentation and support inference from RGBD images. otherprop

91 The Three R s of Vision Recognition Reconstruction Reorganization

92 Thank You

SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite Supplimentary Material

SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite Supplimentary Material Shuran Song Samuel P. Lichtenberg Jianxiong Xiao Princeton University http://rgbd.cs.princeton.edu. Segmetation Result wall