Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera

Similar documents
Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera

Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera

Omni-directional Multi-baseline Stereo without Similarity Measures

Extrinsic Camera Parameter Recovery from Multiple Image Sequences Captured by an Omni-directional Multi-camera System

Camera Recovery of an Omnidirectional Multi-camera System Using GPS Positions

3D MODELING OF OUTDOOR SCENES BY INTEGRATING STOP-AND-GO AND CONTINUOUS SCANNING OF RANGEFINDER

AR Cultural Heritage Reconstruction Based on Feature Landmark Database Constructed by Using Omnidirectional Range Sensor

3D Reconstruction of Outdoor Environments from Omnidirectional Range and Color Images

Estimating Camera Position And Posture by Using Feature Landmark Database

3D Modeling of Outdoor Environments by Integrating Omnidirectional Range and Color Images

DETECTION OF 3D POINTS ON MOVING OBJECTS FROM POINT CLOUD DATA FOR 3D MODELING OF OUTDOOR ENVIRONMENTS

Occlusion Detection of Real Objects using Contour Based Stereo Matching

A Stereo Vision-based Mixed Reality System with Natural Feature Point Tracking

A Statistical Consistency Check for the Space Carving Algorithm.

High Accuracy Depth Measurement using Multi-view Stereo

Real-Time Geometric Registration Using Feature Landmark Database for Augmented Reality Applications

Segmentation and Tracking of Partial Planar Templates

An Initialization Tool for Installing Visual Markers in Wearable Augmented Reality

BIL Computer Vision Apr 16, 2014

Construction of an Immersive Mixed Environment Using an Omnidirectional Stereo Image Sensor

Multiple Baseline Stereo

Construction of Feature Landmark Database Using Omnidirectional Videos and GPS Positions

Localization of Wearable Users Using Invisible Retro-reflective Markers and an IR Camera

Integration of Multiple-baseline Color Stereo Vision with Focus and Defocus Analysis for 3D Shape Measurement

Multi-view stereo. Many slides adapted from S. Seitz

CS5670: Computer Vision

Measurement of Pedestrian Groups Using Subtraction Stereo

Step-by-Step Model Buidling

Structure from motion

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar.

Panoramic 3D Reconstruction Using Rotational Stereo Camera with Simple Epipolar Constraints

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

Structure from Motion

3D Digitization of a Hand-held Object with a Wearable Vision Sensor

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Stereo Vision. MAN-522 Computer Vision

arxiv: v1 [cs.cv] 28 Sep 2018

Multi-view reconstruction for projector camera systems based on bundle adjustment

Background Estimation for a Single Omnidirectional Image Sequence Captured with a Moving Camera

Structure from motion

Dynamic Time Warping for Binocular Hand Tracking and Reconstruction

Task analysis based on observing hands and objects by vision

HIGH SPEED 3-D MEASUREMENT SYSTEM USING INCOHERENT LIGHT SOURCE FOR HUMAN PERFORMANCE ANALYSIS

Lecture 6 Stereo Systems Multi- view geometry Professor Silvio Savarese Computational Vision and Geometry Lab Silvio Savarese Lecture 6-24-Jan-15

Evaluation of Large Scale Scene Reconstruction

3D Environment Measurement Using Binocular Stereo and Motion Stereo by Mobile Robot with Omnidirectional Stereo Camera

Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction

Visual Tracking (1) Feature Point Tracking and Block Matching

Lecture 10: Multi view geometry

EE795: Computer Vision and Intelligent Systems

Object Shape Reconstruction and Pose Estimation by a Camera Mounted on a Mobile Robot

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Structure from Motion

Accurate and Dense Wide-Baseline Stereo Matching Using SW-POC

Factorization Method Using Interpolated Feature Tracking via Projective Geometry

An Algorithm for Real-time Stereo Vision Implementation of Head Pose and Gaze Direction Measurement

CS 4495/7495 Computer Vision Frank Dellaert, Fall 07. Dense Stereo Some Slides by Forsyth & Ponce, Jim Rehg, Sing Bing Kang

Stereo-Matching Techniques Optimisation Using Evolutionary Algorithms

Miniature faking. In close-up photo, the depth of field is limited.

Modeling, Combining, and Rendering Dynamic Real-World Events From Image Sequences

A Survey of Light Source Detection Methods

Multiview Reconstruction

Gain Adaptive Real-Time Stereo Streaming

Stereo and Epipolar geometry

Fast Lighting Independent Background Subtraction

EECS 442: Final Project

High-speed Three-dimensional Mapping by Direct Estimation of a Small Motion Using Range Images

An Automatic Hole Filling Method of Point Cloud for 3D Scanning

A High Speed Face Measurement System

Relative Posture Estimation Using High Frequency Markers

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth

Department of Electrical Engineering, Keio University Hiyoshi Kouhoku-ku Yokohama 223, Japan

Feature Tracking and Optical Flow

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Overview. Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion

A Wearable Augmented Reality System Using Positioning Infrastructures and a Pedometer

Image matching. Announcements. Harder case. Even harder case. Project 1 Out today Help session at the end of class. by Diva Sian.

AN AUTOMATIC 3D RECONSTRUCTION METHOD BASED ON MULTI-VIEW STEREO VISION FOR THE MOGAO GROTTOES

Harder case. Image matching. Even harder case. Harder still? by Diva Sian. by swashford

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey, TR-06531

A COMPREHENSIVE TOOL FOR RECOVERING 3D MODELS FROM 2D PHOTOS WITH WIDE BASELINES

Visual Tracking of Unknown Moving Object by Adaptive Binocular Visual Servoing

Computers & Graphics

Capturing, Modeling, Rendering 3D Structures

STEREO BY TWO-LEVEL DYNAMIC PROGRAMMING

Representing Moving Images with Layers. J. Y. Wang and E. H. Adelson MIT Media Lab

Object Recognition with Invariant Features

IMAGE-BASED 3D ACQUISITION TOOL FOR ARCHITECTURAL CONSERVATION

Recognizing Buildings in Urban Scene of Distant View ABSTRACT

Correspondence and Stereopsis. Original notes by W. Correa. Figures from [Forsyth & Ponce] and [Trucco & Verri]

Chapter 3 Image Registration. Chapter 3 Image Registration

Optical Flow-Based Person Tracking by Multiple Cameras

Metric Structure from Motion

Model Based Perspective Inversion

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

EECS 442 Computer vision. Stereo systems. Stereo vision Rectification Correspondence problem Active stereo vision systems

Stereo vision. Many slides adapted from Steve Seitz

CS231A Midterm Review. Friday 5/6/2016

MOTION STEREO DOUBLE MATCHING RESTRICTION IN 3D MOVEMENT ANALYSIS

Transcription:

Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera Tomokazu Satoy, Masayuki Kanbaray, Naokazu Yokoyay and Haruo Takemuraz ygraduate School of Information Science, Nara Institute of Science and Technology 896-5 Takayama, Ikoma, Nara 630-00, Japan ftomoka-s, masay-ka, yokoyag@is.aist-nara.ac.jp zcybermedia Center, Osaka University Abstract Three-dimensional (3-D) models of outdoor scenes are widely used for object recognition, navigation, mixed reality, and so on. Because such models are often made manually with high costs, automatic and dense 3-D reconstruction is widely investigated. In related work, a dense 3-D model is generated by using a stereo method. However, these methods cannot use several hundreds images together for dense depth estimation because it is difficult to accurately calibrate a large number of cameras. In this paper, we propose a dense 3-D reconstruction method that first estimates extrinsic camera parameters of a hand-held video camera, and then reconstructs a dense 3-D model of a scene. We can acquire a model of the scene accurately by using several hundreds input images.. Introduction Three-dimensional (3-D) models of outdoor scenes are widely used for object recognition, navigation, mixed reality, and so on. Because such models are often made manually with high costs, automatic and dense 3-D reconstruction is desired. In the field of computer vision, there are many researches that reconstruct 3-D models from multiple images []. One of the major approaches to 3-D reconstruction is to use static stereo. However, the methods cannot use a large number of images because it is difficult to calibrate a large number of cameras accurately. Therefore, these methods become sensitive to noise. Although many researchers often use a constraint of surface continuity to reduce noises, such an approach limits a target scene and may sometimes reduce accuracy of reconstruction. One of other approaches is to use an image sequence. The methods that use an image sequence can automatically estimate camera parameters and 3-D positions of natural features by tracking features in captured images. Factorization algorithm [2] is one of the well known methods that can estimate a rough 3-D scene stably and efficiently from an image sequence by assuming an affine camera model. However, when the underlying scene is not suitable for the affine camera model, estimated camera parameters are not reliable. Therefore, this method is not suitable for reconstructing a dense 3-D model by stereo method. Although there exists another reconstruction method [3] that estimates camera parameters and a dense 3-D model from an image sequence, the method uses only a small number of images. Additionally, it seems to be difficult to reconstruct a complex outdoor scene because it uses the constraint of surface continuity in dense depth estimation. In order to reconstruct an outdoor scene densely and stably, we propose a new 3-D reconstruction method that first estimates extrinsic camera parameters of an input image sequence, and then reconstructs a dense model of a scene. In the first process, we use a camera parameter estimation method [4] that is based on tracking both markers and natural features. Next, dense depth maps are computed by using a multi-baseline stereo method from hundreds images. Finally, depth maps are combined together in a voxel space. The proposed method can reconstruct a complex outdoor scene densely and accurately by combining several hundreds images of a long sequence without the constraint of surface continuity. This paper is structured as follows. Section 2 describes a method of estimating camera parameters of a hand-held video camera by tracking markers and natural features. In Section 3, we describe a method of dense depth estimation and integration of these dense data in a voxel space. Then, we demonstrate experimental results of 3-D reconstruction from real image sequences to show the feasibility of the proposed method in Section 4. Finally, Section 5 describes conclusion and future work. 2. Camera parameter estimation by tracking features This section describes an extrinsic camera parameter estimation method which is based on tracking features (markers and natural features). Figure shows the flow diagram of our algorithm. First, we must specify the positions of six

marker specification at the first frame (A) initial camera parameter estimation (a) marker and natural feature tracking (b) extrinsic camera parameter estimation (c) 3-D position estimation of natural features (d) computing confidence of features (e) adding and deleting natural features (B) error minimization in the whole input Figure : Flow diagram of camera parameter estimation. or more markers in the first frame of input sequence to estimate extrinsic camera parameters in the first frame. Then initial extrinsic camera parameters in all the frames are determined by iterating the processes at each frame (A). However, the accumulation of estimation error exists. Therefore, extrinsic camera parameters are refined by minimizing the accumulation of estimation errors over the whole input (B). Using this approach, we can estimate extrinsic camera parameters efficiently and accurately with automatically adding and deleting features regardless of the visibility of initial markers. It should be noted that intrinsic camera parameters must be estimated in advance. 2. Initial camera parameter estimation By iterating the following processes from the first frame to the last frame, initial extrinsic camera parameters and 3-D positions of natural features are determined. (a) Marker and natural feature tracking Markers are tracked by using color and shape information. Natural features are tracked using a robust estimation approach by projecting the 3-D positions of natural features that are estimated until the previous frame. Harris s interest operator is used for robust tracking of natural features. (b) Extrinsic camera parameter estimation In this method, the re-projection error R fp = jx fp ^x fp j 2 is used as a measure for estimation error, where x fp is the tracked 2-D position of feature p and ^x fp is the re-projected position of estimated 3-D position S p of feature p onto the image at the f-th frame using camera parameter M f. Then camera parameter M f at the f-th frame is estimated by minimizing the estimation error E f defined as follows: E f = X p W fp R fp ; () 2 where W fp is a weighting coefficient for the feature p at the f-th frame and is computed by considering the confidence of the feature. (c) 3-D Position estimation of natural features The position S p of the natural feature p in real world is estimated from x fp and M f that have already been determined. The position S p is computed by minimizing a sum of squared distances between S p and straight lines that connect the center of projection in the f-th frame and the position x fp. (c) 3-D Position estimation of natural features The position S p of the natural feature p in real world is estimated from x fp and M f that have already been determined. The position S p is computed by minimizing a sum of squared distances between S p and straight lines that connect the center of projection in the f-th frame and the position x fp of feature p. (d) Computing confidence of features We assume that the distribution of tracking error can be approximated by a Gaussian probability density function. Then the confidence of feature, W fp, is defined by the inverse of variance of re-projection error R fp in the frames where feature p has already been tracked. (e) Adding and deleting natural features Feature candidates that satisfy all the following conditions are added to the set of natural features at every frame. ffl The confidence is over a given threshold. ffl The matching error is less than a given threshold. ffl The output value of Harris s operator is more than a given threshold. ffl The maximum angle between lines that connect the center of projection and estimated 3-D position of the feature candidate is more than a given threshold. On the other hand, natural features that satisfy at least one of the following conditions are deleted at every frame. ffl The confidence is under a given threshold. ffl The matching error is more than a given threshold. 2.2 Optimization in the whole input In the final step, the accumulation of estimation error is minimized over the whole input. The accumulated estimation error is given by the sum of re-projection errors as in Eq.(2) and is minimized with respect to the camera parameter M f and natural feature position S p. E = XX f p W p jx fp ^x fp j 2 : (2) The camera parameters and feature positions that have already been estimated by earlier process for each frame are used for initial values in the minimization. W p is a weighting coefficient for the feature p in the final frame of the image sequence.

3. Dense 3-D reconstruction of scene In this section, we describe a dense 3-D reconstruction method using camera parameters estimated by the method described in Section 2. First, a dense depth map for each image is computed by using a multi-baseline stereo method, then a dense 3-D model is reconstructed by combining obtained dense depth maps in a voxel space. 3.. Depth mapping by multi-baseline stereo A depth map is computed for each frame by using a multibaseline stereo technique [5]. Depth value z of pixel (x; y) at the f-th frame is estimated by using the images from the (f k)-th frame to the (f + k)-th frame. In the following expression, we assume the focal length as for simplicity. Then, the 3-D position of the pixel (x; y) can be expressed by (xz; yz; z), and we can define the projected position (^x j ; ^y j ) of the 3-D position (xz; yz; z) onto the j- th frame as follows: 0 B @ a ^x j a ^y j a C A = M jm f 0 B @ xz yz z C A ; (3) where a is a parameter. In the multi-baseline method, SSD (Sum of Squared Differences) is employed as an error function, that is computed as the sum of squared differences between the window W in the f-th frame centered at (x; y) and that in the j-th frame centered at (^x j ; ^y j ). We define the SSD function in Eq.(4) using RGB components (I Rf ;I Gf ;I Bf ). SSD ij (x; y; o x ;o y )= X Φ (IRi (x +u; y +v) I Rj (^x j +u; ^y j +v)) 2 W (u ox;v oy) +(I Gi (x + u; y + v) I Gj (^x j + u; ^y j + v)) 2 +(I Bi (x + u; y + v) I Bj (^x j + u; ^y j + v)) 2Ψ ; (4) where o x and o y are offsets of the window W for x and y axes, respectively. We define a modified SSSD (Sum of SSD) in Eq.(5) using the median of SSD because the template of window W in the f-th frame may be occluded in other frames. SSSD i (x; y; o x ;o y )= f+k X i=f k where, 8 >< >: SSD fi (x; y; o x ;o y ); SSD fi(x; y)» M and ji f j >C; 0; otherwise M = median(ssd f (f k)(x; y; o x ;o y ); ; SSD f(f C)(x; y; o x ;o y );SSD f(f+c) (x; y; o x ;o y ); (5) ;SSD f(f+k) (x; y; o x ;o y )): (6) Note that images from the (f C)-th frame to the (f +C)-th frame are not used for computing SSSD, because baselines in these frames are not long enough to estimate depth stably. Multiple centered windows approach [6] is also used to reduce estimation error around occlusion boundaries. Then SSSD is extended to SSSDM as follows: SSSDM i (x; y) = min (SSSD(x; y; u; v)): (7) W (u;v) We can estimate the depth value z correctly by minimizing SSSDM unless pixel (x; y) is occluded in more than k frames. Additionally, we avoid a local minimum problem and achieve stable depth estimation using a multiscale approach [7]. Note that we use the linear interpolation to compute z in the regions without informative textures because the confidence of estimated z is low in such regions. 3.2. 3-D reconstruction in voxel space In this paper, a 3-D model is reconstructed in a voxel space by combining several hundreds dense depth maps. In the voxel space, each voxel has two values A and B which are voted by estimated depth data and camera parameters. Both A and B are voted when the voxel is projected onto a pixel (x; y) in a frame. Value A is voted if depth of the voxel in camera coordinate system equals z of (x; y). On the other hands, value B is voted when depth of the voxel is equal to or less than z of (x; y). The 3-D model is reconstructed by selecting the voxel whose A=B is more than a given threshold. Note that the color of the voxel is decided by mean color of pixels that has voted to the value A of the voxel. 4. Experiment In experiment, we captured an outdoor scene as shown in Figure 2(a) by a hand-held CCD camera (Sony DSR-DP- 50) with a wide conversion lens (Sony VCL-HG0758). This image sequence lasts 40 seconds and has 599 frames (720 480 pixels, progressive scan). The intrinsic camera parameters are estimated by using the Tsai s method [8] in advance, and the extrinsic camera parameters are estimated by the method described in Section 2. The curved lines in Figure 3 indicate the camera path and the quadrilateral pyramids indicate the camera postures drawn at every 50 frames. Dense depth map of the f-th frame is estimated by using odd frames from the (f 00)-th to the (f +00)-th frames excluding the (f 5)-th to the (f +5)-th frames. Figure 2(b) shows dense depth maps in which depth values are coded in intensity. It is confirmed that correct depth values are obtained for most part of the images as shown in this figure. However there exist some incorrect depth values between a column and a wall of the building because there are no textures around the wall of the building. Linear interpolation is used for determining depth values in these areas. 3

Figure 3 shows a 3-D model obtained by combining 399 dense depth maps together in the way of voxel voting that is described in Section 3.2. In this experiment, the voxel space consists of cm cube voxels. A wall behind a column of the building is reconstructed even if the wall is occluded from time to time. We also confirm that some positions are holed because these pixels are not visible enough for sufficient precision. 00th frame 5. Conclusion In this paper, a dense 3-D reconstruction method from a monocular image sequence captured by a hand-held video camera is proposed. In this method, first, extrinsic camera parameters are estimated by tracking markers and natural features. Then, at each frame, a dense depth map is computed by using already estimated camera parameters. Finally, a dense 3-D model is reconstructed by combining hundreds of dense depth maps in a voxel space. In the experiments, the dense 3-D scene reconstruction is accomplished for a long image sequence captured in a complex outdoor scene successfully with stable dense depth estimation. However, some parts of reconstructed model have holes. In future work, more accurate model reconstruction will be explored using the confidence of depth value. Integration of dense 3-D models from multiple image sequences will also be investigated for obtaining a complete model. 80th frame 260th frame References [] N. Yokoya, T. Shakunaga and M. Kanbara: Passive Range Sensing Techniques: Depth from Images, IEICE Trans. Inf. and Syst., Vol. E82-D, No. 3, pp. 523 533, 999. [2] C. Tomasi and T. Kanade: Shape and Motion from Image Streams under Orthography: A Factorization Method, Int. Journal of Computer Vision, Vol. 9, No. 2, pp. 37 54, 992. [3] M. Pollefeys, R. Koch, M. Vergauwen, A. A. Deknuydt and L. J. V. Gool: Three-dimentional Scene Reconstruction from Images, Proc. SPIE, Vol. 3958, pp. 25 226, 2000. [4] T. Sato, M. Kanbara, H. Takemura and N. Yokoya: 3-D Reconstruction from a Monocular Image Sequence by Tracking Markers and Natural Features, Proc. 4th Int. Conf. on Vision Interface, pp. 57 64, 200. [5] M. Okutomi and T. Kanade: A Multiple-baseline Stereo, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 5, No. 4, pp. 353 363, 993. [6] R. Kumar, H. S. Sawhney, Y. Guo, S. Hsu and S. Samarasekera: 3D Manipulation of Motion Imagery, Proc. Int. Conf. on Image Processing, pp. 7 20, 2000. [7] N. Yokoya: Surface Reconstruction Directly from Binocular Stereo Images by Multiscale-multistage Regularization, Proc. th Int. Conf. on Pattern Recognition, Vol. I, pp. 642 646, 992. [8] R. Y. Tsai: An Efficient and Accurate Camera Calibration Technique for 3D Machine Vision, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 364 374, 986. 340th frame 420th frame (a) Input images 499th frame (b) Dense depth map Figure 2: Input images and estimated dense depth maps. 4

Figure 3: Results of outdoor scene recovery as well as estimated camera positions and postures. 5