Combining Monocular and Stereo Depth Cues

Similar documents
Image Rectification (Stereo) (New book: 7.2.1, old book: 11.1)

Structure from Motion. Introduction to Computer Vision CSE 152 Lecture 10

3D Computer Vision. Structure from Motion. Prof. Didier Stricker

Correspondence and Stereopsis. Original notes by W. Correa. Figures from [Forsyth & Ponce] and [Trucco & Verri]

Optimizing Monocular Cues for Depth Estimation from Indoor Images

Multiple View Geometry

Correcting User Guided Image Segmentation

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Step-by-Step Model Buidling

CSE 252B: Computer Vision II

EE795: Computer Vision and Intelligent Systems

Stereo Vision. MAN-522 Computer Vision

Epipolar Geometry and Stereo Vision

Announcements. Stereo

Multiple View Geometry. Frank Dellaert

Announcements. Motion. Structure-from-Motion (SFM) Motion. Discrete Motion: Some Counting

55:148 Digital Image Processing Chapter 11 3D Vision, Geometry

Rectification and Distortion Correction

Stereo. 11/02/2012 CS129, Brown James Hays. Slides by Kristen Grauman

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Lecture 14: Basic Multi-View Geometry

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth

Machine vision. Summary # 11: Stereo vision and epipolar geometry. u l = λx. v l = λy

Learning Depth from Single Monocular Images

Announcements. Stereo

Computational Optical Imaging - Optique Numerique. -- Multiple View Geometry and Stereo --

MOTION STEREO DOUBLE MATCHING RESTRICTION IN 3D MOVEMENT ANALYSIS

Multi-stable Perception. Necker Cube

Final Review CMSC 733 Fall 2014

Stereo and Epipolar geometry

Vision par ordinateur

Stereo CSE 576. Ali Farhadi. Several slides from Larry Zitnick and Steve Seitz

3D Scene Reconstruction with a Mobile Camera

Announcements. Motion. Structure-from-Motion (SFM) Motion. Discrete Motion: Some Counting

Epipolar Geometry and Stereo Vision

EECS 442: Final Project

Week 2: Two-View Geometry. Padua Summer 08 Frank Dellaert

Learning to Reconstruct 3D from a Single Image

STRUCTURAL EDGE LEARNING FOR 3-D RECONSTRUCTION FROM A SINGLE STILL IMAGE. Nan Hu. Stanford University Electrical Engineering

Image correspondences and structure from motion

Name of student. Personal number. Teaching assistant. Optimisation: Stereo correspondences for fundamental matrix estimation.

Epipolar Geometry CSE P576. Dr. Matthew Brown

Enabling a Robot to Open Doors Andrei Iancu, Ellen Klingbeil, Justin Pearson Stanford University - CS Fall 2007

Homographies and RANSAC

Assignment 2: Stereo and 3D Reconstruction from Disparity

Epipolar Geometry and Stereo Vision

Structure from Motion and Multi- view Geometry. Last lecture

CS664 Lecture #19: Layers, RANSAC, panoramas, epipolar geometry

There are many cues in monocular vision which suggests that vision in stereo starts very early from two similar 2D images. Lets see a few...

Computer Vision I. Announcements. Random Dot Stereograms. Stereo III. CSE252A Lecture 16

CS231A Midterm Review. Friday 5/6/2016

3D Reconstruction from Two Views

Application questions. Theoretical questions

Miniature faking. In close-up photo, the depth of field is limited.

Final Exam Study Guide

CSCI 5980/8980: Assignment #4. Fundamental Matrix

CS4758: Rovio Augmented Vision Mapping Project

EN1610 Image Understanding Lab # 3: Edges

A Summary of Projective Geometry

Epipolar Geometry Prof. D. Stricker. With slides from A. Zisserman, S. Lazebnik, Seitz

A New Representation for Video Inspection. Fabio Viola

Segmentation and Tracking of Partial Planar Templates

Data-driven Depth Inference from a Single Still Image

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

arxiv: v1 [cs.cv] 28 Sep 2018

Feature Based Registration - Image Alignment

Two-view geometry Computer Vision Spring 2018, Lecture 10

1 CSE 252A Computer Vision I Fall 2017

Visual localization using global visual features and vanishing points

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Combining Appearance and Topology for Wide

Fast Outlier Rejection by Using Parallax-Based Rigidity Constraint for Epipolar Geometry Estimation

Recovering structure from a single view Pinhole perspective projection

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Range Sensors (time of flight) (1)

The end of affine cameras

BUILDING POINT GROUPING USING VIEW-GEOMETRY RELATIONS INTRODUCTION

Dense 3D Reconstruction. Christiano Gava

3D Sensing. 3D Shape from X. Perspective Geometry. Camera Model. Camera Calibration. General Stereo Triangulation.

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Multiview Stereo COSC450. Lecture 8

International Journal of Advance Engineering and Research Development

Feature Detectors and Descriptors: Corners, Lines, etc.

arxiv: v1 [cs.cv] 28 Sep 2018

3D Reconstruction of a Hopkins Landmark

Computer Vision Lecture 20

Multi-View Geometry Part II (Ch7 New book. Ch 10/11 old book)

Undergrad HTAs / TAs. Help me make the course better! HTA deadline today (! sorry) TA deadline March 21 st, opens March 15th

CSE152 Introduction to Computer Vision Assignment 3 (SP15) Instructor: Ben Ochoa Maximum Points : 85 Deadline : 11:59 p.m., Friday, 29-May-2015

Computer Vision Lecture 20

Epipolar Geometry Based On Line Similarity

Structure from motion

MAPI Computer Vision. Multiple View Geometry

On-line and Off-line 3D Reconstruction for Crisis Management Applications

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

CS223b Midterm Exam, Computer Vision. Monday February 25th, Winter 2008, Prof. Jana Kosecka

Motion Tracking and Event Understanding in Video Sequences

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Computational Optical Imaging - Optique Numerique. -- Single and Multiple View Geometry, Stereo matching --

Transcription:

Combining Monocular and Stereo Depth Cues Fraser Cameron December 16, 2005 Abstract A lot of work has been done extracting depth from image sequences, and relatively less has been done using only single images. Very little has been done merging these together. This paper describes the fusing of depth estimation from two images, with monocular cues. The paper will provide an overview of the stereo algorithm, and the details of fusing the stereo range data with monocular image features. 1 Introduction Recent work has been done on depth estimation from single images 1. While this is exciting, it could benefit from the existing stereo image depth algorithms. In a real control system there would afterall be streams of incoming images. This paper will discuss first the stereo algorithm, then how to fuse stereo depths with the monocular features, and finally the results obtained so far. There has been a lot of work done calculating depth from two stereo images. Here we assume only knowledge of the camera calibration matrix, and no relative rotation. The rest is estimated from correspondencences. We use Feature-based depth estimation techniques due to the availability of MATLAB 2 3 code, and a perceived speed bonus. This is more fully explained in section 2. Recent work by Sexena et al. has focused on depth estimation from monocular image cues. They have supervised learning to train a Markov Random Field. This work has also been applied to obstacles depth estimation for use controlling a remote control car at speed 4. This second application specifically requires fast computation. The intention of this work is to begin combining monocular and stereo image cues into a fast computation, allowing improved depth estimation for dynamic control applications. We lay the groundwork for this in section 3. Due to time constraints we were only able to generate and test the stereo system. These details are provided in section 4. Finally the paper close with some conclusions, and notes on work in progress in section 5, and acknowledgements in sections 6. 2 Stereo Depth Estimation Depth Estimation in this paper using image sequences is broken up into 3 sections: Feature Detection and Correlation, Estimating the Fundamental Matrix, Additional Guided Matching, Depth Estimation, and Error Estimation. 2.1 Feature Detection and Correlation For correlation we desire a relatively small and even smattering of feature points. This is achieved by corners. In this paper we use Harris corner detection, which uses a threshold on the top two singular values of small image windows. The Harris algorithm uses a spatial edge suppression technique to prevent detecting multiple edges where only one exists. 1 Ashutosh Saxena, Sung H. Chung, and Andrew Y. Ng, Learning Depth from Single Monocular Images 2 P. D. Kovesi, MATLAB and Octave Functions for Computer Vision and Image Processing http://www.csse.uwa.edu.au/ pk/research/matlabfns/ 3 A Zisserman, MATLAB Functions for Multiple View Geometry http://www.robots.ox.ac.uk/ vgg/hzbook/code/ 4 Jeff Michels, Ashutosh Saxena, Andrew Y. Ng, High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning 1

Kovesi provides a correlation scripts for matching through monogenic phase or direct intensity correlation. Each searches for high window correlations in a range around each feature. Only matches where each feature selects the other are kept. Kovesi notes a typically better performace for matching through monogenic phase, and a potential comparability of speed. I have used monogenic phase matching here and regular matching later, to get the benefits of both. 2.2 Fundamental Matrix From the correlated matches we can generate the fundamental Matrix, F, for the two images. F encodes the location in pixels of the projection of the two cameras on the opposite image plane as well as the rotation between the two images. The equation defining F is: p T r F p l = 0 (1) Where p r, and p l refer to the pixel coordinates of the matches in the right and left image respectively. Here we use the Random Sample Consunsus algorithm (RANSAC) which iteratively computes an F based on a random sample of matches and evaluates how many matches agree. The eight point algorithm is used to construct F. To evaluate whether or not a match fits the current estimate RANSAC simply evaluates (1) and applies a pre-determined threshold. Ransac stores the details of the best candidate found. The algorithm terminates when it is 99% sure that it has chosen a random set of matches containing no outliers. The probability of an outlier being picked is set from the fraction of matches were deemed inliers for the best candidate. Since RANSAC classifies correlations into inliers and outliers, we simply discard the outlying matches. The RANSAC distance function ignores the direction of the match, and large angular deviations from the epipolar line for matches with very similar pixel coordinates. While these checks could be included in the distance function, they are used afterwards, to avoid high computation costs in the iterative RANSAC algorithm. This implementation assumes no relative rotation, since most control algorithms using image streams will have small relative angular rotations, and will be concerned primarily with nearby objects. This leads to a faster and more consistent RANSAC convergence. 2.3 Additional Guided Matches Now that we have determined F we have a lot more information about where matching features should be. Indeed we can reduce it to a one dimensional line search. Further, since we have only translation we can determine whether a matching feature should be closer or farther from the epipole. We use this information to perform another dual correlation search. We perform several rounds of matching, removing succesfully matched points at the end of each round, to obtain as many correlations as possible. 2.4 Depth Estimation With the camera calibration matrix, M, and F found above, one can obtain the Essential Matrix, E, which encodes information on the relative rotation, and position of the two cameras. Not knowing how far or what direction we have moved, results in a scale factor ambiguity in E. We obtain the normalized E using: E = MFM tr((mfm)t (MFM))/2 (2) Given that we have assumed pure translational motion, E takes the form: 0 ˆT z ˆTy E = ˆT z 0 ˆT x (3) ˆT y ˆTx 0 ] Where ˆT is the normalization of the unknown translation vector. It is worth noting that is [ ˆT1 ˆT 3 ˆT2 ˆT 3 1 the projection of the epipole on the image plane. Using M and ˆT one can estimate the depth of points by triangulating their position using: Z l = ˆT 1 x r ˆT3 x l x r (4) Here x r and x l refer to the 1st coordinates of the projected matches in the right and left camera frame respectively. This equation follows the convention that third component of the projected matches is 1. They are obtained 2

from p r = M p r. Since we have assumed R = I 3 the only ambiguity comes from the sign of ˆT, which just causes a sign switch in Z l. In cases where x r x l is too small we substitute y r and y l instead. 2.5 Error Estimation To estimate the error of this stereo algorithm, a series of photographs and laser depth scans were taken around campus with 2 foot separations. Thus the comparison is between the depths calculated from the pictures and the laser range depths. There are a number of sources of error in this matching: 1. The laser depths are taken slightly misaligned from the camera. This misalignment can very easily cause massive errors due to the features being positioned often on the edge of large stepchanges in depth. An effort has been made to select pictures with many features in consistent depth regions. 2. The relatively fewer corners offered by structured rather than unstructured scenes increases the correlation accuracy, since there are simply fewer wrong choices to confuse the matching algorithm. In extreme cases, there were not enough matches to estimate F, leading to no depth data at all. 3. Since the pictures were taken at different times near sundown, about 5 minutes, some scenes have different lighting conditions, we have removed the scenes with very significant lighting changes, but some lighting differences remain. Also, some objects (people and bikes) have moved between pictures. 4. The laser depths and stereo depths have different ranges. This leads to large errors when stereo depths are well beyond the maximum laser depth. 5. Metal or windows can create spurious features from reflected light. They can also create laser depth readings by not reflectingthe laser back. 3 Fusing with Monocular Cues Saxena, et al s Learning Depth from Single Monocular Images uses a Markov Random Field to model the depth relations among image patches. They then maximize the distance probability over several parameters given the dataset. While, they use both Laplacian and Gaussian distributions, we will only concern ourselves with Gaussian distributions here. Further they use multiple scale leading to several layers of distance paramters. Only base distances are left independent, as the lower resolution copies are restricted to be averages. We extend their gaussian model by adding a gaussian penalty for distance from the stereo depths. The model is: [ P(d X;θ;σ) = 1 Z exp ] (d i (1) d m d SFMi ) 2 M (d(1) x T i θ r) 2 i SFM 2σ 2 SFM exp 3 M i=1 s=1 i=1 j N 2(i) 2σ 2 1r (d i (s) d j (s)) 2 (5) d i (s) refers to distance i at scale s; SFM is the set of point provided by the stereo algorithm; d SFMi is the distance provided by the stereo algorithm for point i;d m = d i i SFM SFM ;σ SFM is an empirically determined quantity expressing the error in the measurements. All other parameters are as described in Saxena, et al s work. We choose to restrict If we put this into the standard Gaussian form, exp( (d µ) T Σ 1 (d µ)), the maximum liklihood estimate for d would be µ. Expanding by order of d we see: Σ 1 = Σ 1 SFM + Σ 1 1r + Σ 1 2rs i SFM (6) j N 2(i) Where Σ SFM,Σ 1r,andΣ 2rs are known empirically. µ can then be found by iteratively fitting it to minimize the gap between the two term of order 1 in d. This, combined with the current monocular techniques gives us a gaussian models for each d i, and thus the maximum liklihood estimate. 4 Results Figure 1 shows an image in various stages of processing. For unstructured environments there are an enormous number of corners leading to many correlations. These are then whittled down by estimating F, and converted into depths, shown below. Stereo depth algorithms will never show the sky as it has no corners. s 2σ 2 2rs (7) 3

(c) Figure 1: a) Base left image. b) Potential matches from correlation. c)right camera image with inlying matches, crosses indicate corners on right image. 10 20 30 40 50 60 70 80 90 100 110 50 100 150 200 250 Figure 2: a) Estimated depth from features. b) Laser range scan data Figure 2 shows the estimated depth on the left and the true depth on the right. (c) Figure 3: a) Base left image. b) Potential matches from correlation. c)right camera image with inlying matches, crosses indicate corners on right image. Figure 3 shows a structured scene. Here we see the effect of glass, causing the algorithm to give the depth of a reflection in the top right. Further, we see the error inherent in extremely regular textures. For a selected set of structured images, for which a large proportion of the features exist in surfaces as opposed to on the edge, this stereo algorithm has a true log depth to calculated log depth correlation of 0.8606, and a standard deviation of 0.4472. For reference, a constant guess of the mean produces a standard deviation of 0.8773. Due to the scale ambiguity in our range estimates, they were shifted to have the same mean. 5 Conclusion Image manipulation, and 3-D reconstruction are not trivial. Indeed, although some code existed, I needed to code up a substantial amount of the process. Particularily, there were numerous spurious matches that needed to be pruned. The stereo algorithm that resulted works, but would benefit from some fine tuning, and speed enhancements. It is more accurate, but less informative for structured scenes. 4

Comparing the calculated range data to laser range scans requires significant processing to ensure that the limited ranges, and misalignments do not destroy the results. I could easily get near zeros correlations through the right choice of scene. Clearly more work needs to be done. The to do list includes using the camera calibration to undistort the image, adaptive distance limits for the guided matches, automatically selecting features insensitive to laser scanner/camera misalignment for error measures, algorithm speed improvements, and of course complete integration with monocular features. 6 Aknowledgments I would like to thank Prof. Ng for the loan of his book Introductory Techniques for 3-D computer Vision by Trucco & Verri. Also, Ashutosh Saxena was quite generous with his time. Further, I used a large number of online resources, mainly the sample chapter from Andrew Zisserman s Multiple View Geometry in Computer Vision entitled Epipolar Geometry and the Fundamental Matrix. 5