Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Similar documents
Statistics of Natural Image Categories

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

STRUCTURAL EDGE LEARNING FOR 3-D RECONSTRUCTION FROM A SINGLE STILL IMAGE. Nan Hu. Stanford University Electrical Engineering

Schedule for Rest of Semester

Mixture Models and EM

EE795: Computer Vision and Intelligent Systems

Learning Depth from Single Monocular Images

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Object Segmentation and Tracking in 3D Video With Sparse Depth Information Using a Fully Connected CRF Model

Data-driven Depth Inference from a Single Still Image

Correcting User Guided Image Segmentation

Can Similar Scenes help Surface Layout Estimation?

Optimizing Monocular Cues for Depth Estimation from Indoor Images

Learning to Reconstruct 3D from a Single Image

Classification and Detection in Images. D.A. Forsyth

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

Segmentation and Tracking of Partial Planar Templates

DEPT: Depth Estimation by Parameter Transfer for Single Still Images

Robotics Programming Laboratory

ELEC Dr Reji Mathew Electrical Engineering UNSW

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Texture. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors

Region-based Segmentation and Object Detection

SuRVoS Workbench. Super-Region Volume Segmentation. Imanol Luengo

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Contexts and 3D Scenes

Chapter 3 Image Registration. Chapter 3 Image Registration

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Markov Networks in Computer Vision

IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues

Global Probability of Boundary

Context. CS 554 Computer Vision Pinar Duygulu Bilkent University. (Source:Antonio Torralba, James Hays)

Dense Image-based Motion Estimation Algorithms & Optical Flow

Markov Networks in Computer Vision. Sargur Srihari

Multiple-Choice Questionnaire Group C

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

EE795: Computer Vision and Intelligent Systems

Motion Tracking and Event Understanding in Video Sequences

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Contexts and 3D Scenes

Online structured learning for Obstacle avoidance

DEPT: Depth Estimation by Parameter Transfer for Single Still Images

Supervised texture detection in images

Enabling a Robot to Open Doors Andrei Iancu, Ellen Klingbeil, Justin Pearson Stanford University - CS Fall 2007

CS 4495 Computer Vision A. Bobick. CS 4495 Computer Vision. Features 2 SIFT descriptor. Aaron Bobick School of Interactive Computing

Part-based and local feature models for generic object recognition

Object Detection by 3D Aspectlets and Occlusion Reasoning

WP1: Video Data Analysis

BSB663 Image Processing Pinar Duygulu. Slides are adapted from Selim Aksoy

Markov Random Fields and Gibbs Sampling for Image Denoising

Learning the Ecological Statistics of Perceptual Organization

Chapter 9. Classification and Clustering

Overview. Video. Overview 4/7/2008. Optical flow. Why estimate motion? Motion estimation: Optical flow. Motion Magnification Colorization.

Image classification by a Two Dimensional Hidden Markov Model

CRF Based Point Cloud Segmentation Jonathan Nation

Separating Objects and Clutter in Indoor Scenes

Discovering Visual Hierarchy through Unsupervised Learning Haider Razvi

CRFs for Image Classification

Automatic Photo Popup

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Perceptual Grouping from Motion Cues Using Tensor Voting

Visual localization using global visual features and vanishing points

Stereo Vision. MAN-522 Computer Vision

Revisiting Depth Layers from Occlusions

Bus Detection and recognition for visually impaired people

Lab 9. Julia Janicki. Introduction

Local Image Features

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Object recognition (part 2)

CS 664 Segmentation. Daniel Huttenlocher

Viewpoint Invariant Features from Single Images Using 3D Geometry

Part III: Affinity Functions for Image Segmentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

CSE 527: Introduction to Computer Vision

Contents I IMAGE FORMATION 1

CS 4495 Computer Vision Motion and Optic Flow

Depth from Stereo. Dominic Cheng February 7, 2018

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

Markov Random Fields and Segmentation with Graph Cuts

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Local Image Features

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Subpixel accurate refinement of disparity maps using stereo correspondences

A MULTI-RESOLUTION APPROACH TO DEPTH FIELD ESTIMATION IN DENSE IMAGE ARRAYS F. Battisti, M. Brizzi, M. Carli, A. Neri

Fast Edge Detection Using Structured Forests

Classification of Protein Crystallization Imagery

ECE 176 Digital Image Processing Handout #14 Pamela Cosman 4/29/05 TEXTURE ANALYSIS

Multi-view Stereo. Ivo Boyadzhiev CS7670: September 13, 2011

SIFT: SCALE INVARIANT FEATURE TRANSFORM SURF: SPEEDED UP ROBUST FEATURES BASHAR ALSADIK EOS DEPT. TOPMAP M13 3D GEOINFORMATION FROM IMAGES 2014

EE795: Computer Vision and Intelligent Systems

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM

Applying Supervised Learning

Combining Monocular and Stereo Depth Cues

Figure 1: Workflow of object-based classification

Direct Methods in Visual Odometry

Transcription:

Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009

Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer absolute depth? Learn the relationship between image features and absolute depth

coarse Mean depth (Paper #1) Geometric stage (Paper #2) fine Dense depth map (Paper #3)

A. Torralba, A. Oliva., Depth estimation from image structure, PAMI, 24(9): 1-13, 2002 Goal: estimate absolute mean depth of a scene based on scene structure This paper inspired the invention of GIST feature

Scenes at different depths have different spatial structures Panoramic views: Uniform texture zones along horizontal layers Mid-range urban environments: Dominant long horizontal and vertical contours and square patterns Close-up views of objects: Large flat surfaces with no dominant orientation

Global spectral signature Defined as the amplitude spectrum of the entire image Man-made and natural scenes have disparate global spectral signatures

Specific examples: On average: Man-made Natural

Local spectral signature Defined as the magnitude of the output of Gabor wavelet filters Not only differs between man-made and natural scenes, but is spatially non-stationary as well

How do the global and local spectral signatures change with the mean depth of images? The answer to this question determines if it is feasible to estimate mean depth using those spectral signatures

Man-made

Natural

Typical behavior of spectral signatures An increase of global roughness w.r.t. depth for man-made structures A decrease of global roughness w.r.t. depth for natural structures Local spectral signature becomes increasingly nonstationary as depth gets large More biased towards horizontal and vertical orientations as depth increases

Global spectral signature Amplitude spectrum is defined on a continuous 2-D frequency space Discretize it by sampling at specific frequency magnitudes and orientations Approximated by summing the energy of wavelet coefficients over the entire image at specific scales and orientations Termed as global energy, which encodes dominant orientations and scales in the image

Global spectral signature (another variant) Wavelet coefficients at different scales and orientations are correlated Define the magnitude correlation Magnitude correlations encode degree of clutter of edges and shapes

Local spectral signature The original local spectral signature has the same spatial resolution as the image Downsample it to contain only pixels Termed as local energy, which encodes local scales and orientations

Dimension of global energy: Dimension of magnitude correlations: Dimension of local energy: Perform PCA to reduce the feature dimension to

Regress the mean depth on feature vector Need to approximate the regression function Take a generative approach: Given a feature vector, use Gaussian Bayes classifier to determine if the image belongs to man-made scene group or natural scene group

Given each scene group, model the joint distribution of and as a two-level hierarchical mixture of Gaussians where

Note that the mean of given and cluster is a linear function of, suggesting that the ML estimation of and is equivalent to LS linear regression

Use EM algorithm to learn the mixture model E-step M-step

Having learned the model, The estimate of is given by a mixture of linear regressions The confidence of the estimate is given by

Ground truth generation Divide the entire training data set into man-made scene group and natural scene group Sort the images in each group according to their mean depths Humans estimate the mean depths of a portion of the images in each group Use a polynomial to fit the estimated mean depth as a function of the sorted rank

Using global energy features

Using local energy features

Comparison

Scene category recognition

Scale selection

V. Nedovic, A. W.M. Smeulders, A. Redert and J.M. Geusebroek, Depth Information by Stage Classification, ICCV, 2007 Goal: Estimate the geometric type of the global scene in an image

Instead of directly regressing absolute depth, it may be helpful to classify the image into relatively few 3D scene geometries (stages) first This type of scene information narrows down possible locations, scale, and identities of individual objects

Image gradient has different distributions as depth increases Makes it possible to use image gradient features to perform stage classification

Dataset Keyframes of the 2006 TRECVID video benchmark dataset Annotate 1241 TRECVID keyframes into one of the 15 stage categories For each category, half for training and half for testing

Evaluation Correct rate = true positives + true negatives total number of images For a class that occupies p% of the dataset, the correct rate of random guess is p/k + (1-p)(K-1)/K, where K is the total number of classes When p is small, this measure is overly optimistic

FN TN TP + TN TP + TN + FP + FN is pretty large! TP FP

Lower level of hierarchy 12 classes chance 86.42% 85.75% 84.42% 85.50% 82.71% 86.33% 87.08% 84.17% 82.71% 85.50% 80.75% 85.50% 84.74%

My experiments Stanford Range Image Dataset Classify 271 images into 8 stage categories For each stage, 2/3 used for training and 1/3 for testing

Confusion matrix Overall accuracy = # all TP # all images Overall accuracy: 24.71% Recall 1 2 3 4 5 6 7 8 28.57% 40.00% 33.33% 18.18% 0.00% 26.32% 12.50% 0.00% 1 2 3 4 5 6 7 8 Chance: 12.50%

Some misclassified examples should be classified as should be classified as

What if we use multi-scale features? For each image, I extracted features from 6 levels of the corresponding pyramid, and performed PCA to keep the feature dimension the same as in the paper (64) Result: the overall accuracy increases to 31.76%

What if we use Gist features? The Gist features also undergo PCA to keep its dimension to be 64 Still using multi-class SVM Result: the overall accuracy further increases to 40.00%

Confusion matrix Overall accuracy: 40.00% Recall 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Chance: 12.50% 28.57% 56.00% 66.67% 18.18% 22.22% 47.37% 37.50% 0.00%

What is we use replace multi-class SVM with simple k-nn? The accuracy turns out to be better! (42.35%) Fancy stuffs are not necessarily good

A. Saxena, S. H. Chung, A. Y. Ng, Learning Depth from Single Monocular Images, NIPS, 2005 Goal: recover absolute depth value for each small patch of a single monocular image

Texture intensity, edge directions and haze look different at different depths They can be extracted as cues to infer depth Inferring depth from the cues of an individual patch is not reliable Incorporate the cues of nearby patches Use MRF to enforce constraints

For each patch Apply a filter bank on the intensity channel to extract texture energy Apply a low-pass filter on the two color channels to capture haze. Apply another filter bank on the intensity channel to extract edge directions

The absolute and squared filter outputs are summed over all the pixels within the patch Each filter yields two values 17 filters in total Initial feature vector of dimension 34

To incorporate contextual information Cues from the four immediate neighboring patches are included The process is repeated at three scales Cues from the column the patch lies in Final feature vector: 19*34=646 dimensional

These features are used to estimate how different the depths of two adjacent patches can be Measure the difference of the statistics of the two patches Uses the difference of the concatenated histograms of the 17 filter outputs 10 bins for each histogram, yielding a dimension of 170

The model Unary potential Linear predictor Uncertainty reflected by, which depends on the absolute depth feature of the patch Different rows have different model parameters (camera is mounted horizontally)

The model Pairwise potential Encourage smooth depth map Strength of smoothness constraint controlled by, which is determined by the appearance difference of neighboring patches (captured by relative depth feature) Different rows have different model parameters

The model Pairwise potential Multi-scale model At the smaller scale Appearance very different Allows for large discontinuity in depth At the larger scale Appearance similar Strong constraint on depth continuity

Learning parameters Unary parameters : linear least square problem, choose to fit the expected value of, s.t. Pairwise parameters, choose to fit the expected value of, s.t.

The model Replace square with absolute value Replace variance parameters with spread parameters Learning parameters Choose to minimize Fit to the expected value of Fit to the expected value of

Gaussian model The log-likelihood is quadratic in, the MAP estimate is found in closed form Laplacian model Use linear programming to find MAP estimate

Ground truth depth maps collected using a 3D laser scanner 425 image+depthmap pairs 75% for training, 25% for testing Transform all the depths to a log scale before training

Multiscale and column features help a lot Pairwise terms further improves performance

Laplacian model outperforms Gaussian model empirically appears Laplacian Heavier tail more robust to outliers and errors Laplacian tends to model sharp transitions better

Not only do we want to estimate the depth map, but we want to infer the orientations of surfaces as well Follow-up paper: A. Saxena, M. Sun, A. Y. Ng, Make3D: Learning 3-D Scene Structure from a Single Still Image, PAMI, 2008

How to relate surface orientation with depth? Plane equation:, where represents both the 3D location and orientation of the surface : unit vector from the camera center to a point lying on the plane. This vector is known. The coordinate of the point is Therefore,

Modification of the model Replace image patches with superpixels For each superpixel, extract the features mentioned in the previous paper, plus shape and location features In unary potential, relative error is used is the confidence of the linear predictor, which is learned as another linear function of the superpixel features

Modification of the model In pairwise potential, a pair of surfaces is penalized for deviating from Being connected Being co-planar Being co-linear Strength of penalty is different under different conditions When there is evidence that two superpixels are separated by an occlusion boundary, the first two penalties are alleviated accordingly When there is evidence that two superpixels lie on a common straight long line, the third penalty is imposed accordingly

Quantitative comparison Spatial support is important Geometric constraints help, especially qualitatively

Qualitative comparison

More impressive results