StereoScan: Dense 3D Reconstruction in Real-time

Similar documents
Efficient Large-Scale Stereo Matching

Machine vision. Summary # 11: Stereo vision and epipolar geometry. u l = λx. v l = λy

Multiview Stereo COSC450. Lecture 8

Computer Vision Lecture 17

Computer Vision Lecture 17

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CS664 Lecture #18: Motion

Multiple View Geometry

Dense 3D Reconstruction. Christiano Gava

CS 231A: Computer Vision (Winter 2018) Problem Set 2

Structure from Motion. Introduction to Computer Vision CSE 152 Lecture 10

Lost! Leveraging the Crowd for Probabilistic Visual Self-Localization

Vision-based Mobile Robot Localization and Mapping using Scale-Invariant Features

Epipolar Geometry and Stereo Vision

Multiple View Geometry

Camera calibration. Robotic vision. Ville Kyrki

CS231A Course Notes 4: Stereo Systems and Structure from Motion

Live Metric 3D Reconstruction on Mobile Phones ICCV 2013

Stereo vision. Many slides adapted from Steve Seitz

Depth from two cameras: stereopsis

Announcements. Stereo

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Image Rectification (Stereo) (New book: 7.2.1, old book: 11.1)

Stereo Vision. MAN-522 Computer Vision

Direct Plane Tracking in Stereo Images for Mobile Navigation

Dense 3D Reconstruction. Christiano Gava

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

EECS 442: Final Project

CS 395T Lecture 12: Feature Matching and Bundle Adjustment. Qixing Huang October 10 st 2018

CSCI 5980/8980: Assignment #4. Fundamental Matrix

Depth from two cameras: stereopsis

Transactions on Information and Communications Technologies vol 16, 1996 WIT Press, ISSN

MOTION. Feature Matching/Tracking. Control Signal Generation REFERENCE IMAGE

Fast, Unconstrained Camera Motion Estimation from Stereo without Tracking and Robust Statistics

Stereo and Epipolar geometry

arxiv: v1 [cs.cv] 28 Sep 2018

ROBUST SCALE ESTIMATION FOR MONOCULAR VISUAL ODOMETRY USING STRUCTURE FROM MOTION AND VANISHING POINTS

Announcements. Stereo

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CS5670: Computer Vision

CHAPTER 3 DISPARITY AND DEPTH MAP COMPUTATION

Two-view geometry Computer Vision Spring 2018, Lecture 10

Computer Vision I - Algorithms and Applications: Multi-View 3D reconstruction

55:148 Digital Image Processing Chapter 11 3D Vision, Geometry

3-Point RANSAC for Fast Vision based Rotation Estimation using GPU Technology

Towards a visual perception system for LNG pipe inspection

There are many cues in monocular vision which suggests that vision in stereo starts very early from two similar 2D images. Lets see a few...

Index. 3D reconstruction, point algorithm, point algorithm, point algorithm, point algorithm, 263

Epipolar Geometry and Stereo Vision

3D Sensing and Reconstruction Readings: Ch 12: , Ch 13: ,

Srikumar Ramalingam. Review. 3D Reconstruction. Pose Estimation Revisited. School of Computing University of Utah

STEREO IMAGE POINT CLOUD AND LIDAR POINT CLOUD FUSION FOR THE 3D STREET MAPPING

arxiv: v1 [cs.cv] 18 Sep 2017

Kinect Cursor Control EEE178 Dr. Fethi Belkhouche Christopher Harris Danny Nguyen I. INTRODUCTION

Srikumar Ramalingam. Review. 3D Reconstruction. Pose Estimation Revisited. School of Computing University of Utah

Image Based Reconstruction II

A Systems View of Large- Scale 3D Reconstruction

EE795: Computer Vision and Intelligent Systems

CS231A Midterm Review. Friday 5/6/2016

Epipolar Geometry CSE P576. Dr. Matthew Brown

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

Application questions. Theoretical questions

Accurate and Dense Wide-Baseline Stereo Matching Using SW-POC

Binocular Stereo Vision. System 6 Introduction Is there a Wedge in this 3D scene?

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

Overview. Augmented reality and applications Marker-based augmented reality. Camera model. Binary markers Textured planar markers

Index. 3D reconstruction, point algorithm, point algorithm, point algorithm, point algorithm, 253

Task selection for control of active vision systems

Binocular cues to depth PSY 310 Greg Francis. Lecture 21. Depth perception

Infrared Camera Calibration in the 3D Temperature Field Reconstruction

Multi-View Stereo for Static and Dynamic Scenes

Comments on Consistent Depth Maps Recovery from a Video Sequence

Final Exam Study Guide

Dense Tracking and Mapping for Autonomous Quadrocopters. Jürgen Sturm

Stereo Rig Final Report

Lecture'9'&'10:'' Stereo'Vision'

PRELIMINARY RESULTS ON REAL-TIME 3D FEATURE-BASED TRACKER 1. We present some preliminary results on a system for tracking 3D motion using

Recap: Features and filters. Recap: Grouping & fitting. Now: Multiple views 10/29/2008. Epipolar geometry & stereo vision. Why multiple views?

Miniature faking. In close-up photo, the depth of field is limited.

Geometric Reconstruction Dense reconstruction of scene geometry

Multi-view stereo. Many slides adapted from S. Seitz

Lecture 9 & 10: Stereo Vision

Step-by-Step Model Buidling

3D Computer Vision. Structured Light I. Prof. Didier Stricker. Kaiserlautern University.

CS 532: 3D Computer Vision 7 th Set of Notes

CSE 252A Computer Vision Homework 3 Instructor: Ben Ochoa Due : Monday, November 21, 2016, 11:59 PM

Using temporal seeding to constrain the disparity search range in stereo matching

Monocular Heading Estimation in Non-stationary Urban Environment

Hartley - Zisserman reading club. Part I: Hartley and Zisserman Appendix 6: Part II: Zhengyou Zhang: Presented by Daniel Fontijne

Structured Light. Tobias Nöll Thanks to Marc Pollefeys, David Nister and David Lowe

From Orientation to Functional Modeling for Terrestrial and UAV Images

Stereo II CSE 576. Ali Farhadi. Several slides from Larry Zitnick and Steve Seitz

Recap from Previous Lecture

Dynamic Time Warping for Binocular Hand Tracking and Reconstruction

BIL Computer Vision Apr 16, 2014

Chaplin, Modern Times, 1936

Factorization with Missing and Noisy Data

LUMS Mine Detector Project

1 Projective Geometry

A Framework for Global Vehicle Localization Using Stereo Images and Satellite and Road Maps

Transcription:

STANFORD UNIVERSITY, COMPUTER SCIENCE, STANFORD CS231A SPRING 2016 StereoScan: Dense 3D Reconstruction in Real-time Peirong Ji, pji@stanford.edu June 7, 2016 1 INTRODUCTION In this project, I am trying to replicate a published paper, StereoScan: Dense 3D Reconstruction in Real-time, by Andreas Geiger. As a core subject in computer vision and robotics, 3D perception from stereo video is a practical problem and hence draws a lot of interest from researchers around the world. In the paper, the author proposes a new approach to build 3d maps from stereo sequences in real-time without expensive hardware involved. More specifically, the author combines both efficient stereo matching and multi-view linking scheme for generating consistent 3d point clouds. The experiment shows that the visual odometry part of the algorithm runs at 25 fps and the depth maps 3 to 4 fps, which is sufficient for online 3D reconstruction. This paper has three main significant contribution: real-time scene flow computation with multi-thousand feature matches, robust visual odometry algorithm proposed and solving the associated correspondence problem in 3D reconstruction in a greedy, accurate and efficient way. Even though stereo matching is part of the 3D reconstruction pipeline, it s, however, in the scope of another paper. In this paper, we assume the stereo system moves smoothly and it mimics what a robot or human can see while it s moving. As for this project, a stereo sequence is downloaded from web[1]. This stereo sequence is calibrated and rectified, which simplifies system pipeline. This data set includes the stereo system parameters. Even though I cannot achieve the same 1

performance as the author s, I am able to generate 3-4 frame per second. All the code in this project are carried out in C language and the only 3rd party library used is the libpng. 2 RELATED WORK AND ACHIEVEMENT 2.1 RELATED WORK Reconstructing 3D scene from stereo sequence is a core topic in computer vision and there are lots of related work. Here is a summary. The most naive one is what we explored in class, 3d voxel. The same idea can be applied to stereo 3D reconstruction, however, the complexity is high if we want to have a good resolution. SLAM, short of Simultaneous Localisation and Mapping, is a process a mobile device building the 3D map around it and use the information to locate itself. However, for computation reason, most proposed approaches are only to handle very sparse sets of landmarks not dense 3D reconstruction, which is the topic of this paper. 3D reconstruction using classical Structure-from-Motion is also explored by lots of researchers. This technique requires a collection of images of a scene. It s not suitable for mobile stereo system moving continuously. 2.2 ACHIEVEMENT In this paper, the resolution of 3D reconstruction is retained at the same time with the performance. This is achieved by an accurate egomotion estimation and high speed disparity map generator. The 3D reconstruction pipeline involves feature matching, egomotion estimation, stereo matching(scope of another paper) and 3D reconstruction. As the important parts of this paper, feature matching and egomotion estimation are both fully implemented in C and stero matching is integrated in the pipeline, while the 3D reconstruction is not finished yet. 3 3D RECONSTRUCTION SYSTEM As shown in 3.1, the 3D reconstruction system is divided into 4 parts. They are feature matching, egomotion estimation, stereo matching and 3D reconstruction. As the first stage, the feature matching stage try to obtain the feature correspondences in four images, which are the left,right images of two consecutive frames. Based on the feature correspondences from the previous stage, the egomotion estimation stage produce the motion matrix of the stereo system, which includes the Rotation matrix R and the Translation vector T. Stereo matching stage is used to obtain dense disparity maps. During the implementation of this paper, I am going to integrate the ELAS library into the system to obtain the disparity map. The last step is the 3D reconstruction step, which creates consistent point-based models from the large amount of incoming data from the previous stages. The author propose a greedy approach which solves the association problem by re-projecting reconstructed 3D points of the previous frame into the image plane of the current frame. Details about each pipeline stage is detailed in the following subsections. 2

Figure 3.1: 3D Reconstruction System Pipeline 3

3.1 FEATURE EXTRACTION AND FEATURE MATCHING Supposing calibrated stereo setup and rectified stereo images, as the first stage of the pipeline, feature extraction and feature matching take 4 images, involving left and right images of consecutive frames, computes the feature of each image and find matching key points among them. (a) Figure 3.2: Feature Extraction Taking the gray scale stereo images as input, we first calculate the sobel response, blob and corner response of the image. All the filters above are 5x5 and shown in 3.3. Using the output of the blob filter and checkboard filter, we run the non-maximum suppression to reduce the number of potential matching points. Typically, a 0.5M resolution image produces 9K points after the non-maximum suppression stage. Also, the points from the non-maximum suppression are divided into 4 groups, blob max, blob min, corner max and corner min. The feature matching stage only match the point within each group. This can save a lot of computation efforts. As for the feature vector of each potential matching point, they are obtained by concatenating the sobel response at the point, forming a 32x1 feature vector. (a) Figure 3.3: Feature Extraction Above we have detailed the feature extraction stage, to find matching key points, we iterate all the feature points in left image of previous frame. For each feature point, we find the best matching point in the right image of the previous frame and use this best matching point to find the next best matching point in the right image of the current frame. So on and so on, we find the matching point in a loop formed by four consecutive images in every two consecutive frames. If this loop closes, then we state that we find a matching group across four images. Here since we take the assumption that the stereo system is moving smoothly, so we can find the matching point in a bounding box, which is 50x50 around the feature point. This procedure is depicted below. 4

(a) Figure 3.4: Loop Matching in Four Images After feature extractor and feature matching, we get a list of matching points of four images. These points are served as input of egomotion estimation and disparity map generator. Typically, we got around 2K key points for each image. 3.2 EGOMOTION ESTIMATION Given the matching of the previous frame, we can calculate the matching points 3D location in previous stereo system coordinate via triangulation. d = max(u pl u pr,0.001) (3.1) z p = f ST EREO_B ASE/d (3.2) y p = (v pl v pr ) ST EREO_B ASE/d (3.3) x p = (u pl u pr ) ST EREO_B ASE/d (3.4) Equation 3.1 denotes d as the disparity of matching pair in the previous left and right images. (x p,y p,z p ) is the matching point s 3D location in the previous stereo system coordinate. Supposing R(r ) is the rotation matrix and T is the translate matrix that depicts the stereo system s rotation and movement since the previous frame. f is the camera focus length and ST EREO_B ASE is the stereo base. Assuming squared pixels and zero skew, (x p,y p,z p ) can be projected into the current image planes using Eq.3.5. u c v c 1 f 0 c u r 00 r 01 r 02 = 0 f c v r 10 r 11 r 12 0 0 1 r 20 r 21 r 22 x p y p z p + t x t y t z s 0 (3.5) 0 5

When projecting the 3D point in previous coordinate into current image plane, in right image, s is the baseline, in left image, s = 0. Let s denote π as a mapper which given a point (x p,y p,z p ) and projects it into a 2D plane under the parameter r i j and (t x, t y, t z ). Then to estimate the stereo system motion during the past one frame, we try to obtain a pair of r i j and (t x, t y, t z ), which minimizes 3.6. u l and v l denotes the coordinates in the current left image plane of the 3D points in previous frame. ( 3 u l ) i v l π l ( i i=1 = x p y P z p ;R,T ) 2 + ( u r ) i v r π r ( i x p y P ;R,T ) 2 z p 3 (u l i ul ) 2 + (v l i vl ) 2 + (u r i ur ) 2 + (v r i vr ) 2 i=1 (3.6) This optimization problem can be solved using Gauss-Newton iteration. The residues is a 12x1 vector, which denotes the distance between the points in the current frame images and the coordinate obtained via re-projection in x coordinate and y coordinate respectively. To do this optimization, we need the Jacobians matrix of the residues. Let s denote the residues vector as r ( β ) = (r 0,r 1,r 2,r 3,..,r 11 ) and parameter vector as β = ( r x,r y,r z, t x, t y, t z ). Then the Jocobians Matrix is 3.8 J =... r x r y r z t x t y t z r x r y r z t x t y t z r 1 1 r 1 1 r 1 1 r 1 1 r 1 1 r 1 1 r x r y r z t x t y t z (3.7) The parameter iteration can be expressed as: β i = β i 1 (J T J) 1 J T r (β i 1 ) (3.8) The pseudo code below details the procedure of egomotion estimation. 6

Listing 1: Egomotion Estimation 1. Calculate the matching points 3D location ( x, y, z ) in previous stereo system coordinate 2. for i =1:50 // RANSAC x50 random pick 3 matching points from f e a t u r e matching stage i n i t i a l i z e β 0 = (0,0,0,0,0,0) for j =1:20 calculate Jacobian Matrix J β i = β i 1 (J T J) 1 J T r (β i 1 ) // i f converge i f (J T J) 1 J T r (β i 1 )<δ : break end calculate i n l i e r s using β and save the best bet a end The accuracy of the egomotion matrix is critical as we are going to use it to re-project all the 3D cloud point into a common coordinate system. As you will see in the experiment, we achieved pretty good accuracy. 3.3 STEREO MATCHING[1] Stereo matching stage takes the output of the feature matching stage and builds a disparity map for each frame. In this paper, the disparity map is built with a free library called ELAS[2]. ELAS is a novel approach to binocular stereo for fast matching of high resolution imagery. It builds a prior on the disparities by forming a triangulation on a set of supporting points which can be robustly matched, reducing matching ambiguities of the remaining points. This allows for efficient exploitation of the disparity search space, yielding accurate and dense reconstructions without global optimization. 3.4 3D RECONSTRUCTION The simplest method to point-based 3d reconstructions map all valid pixels to 3d and projects them into a common coordinate system using the egomotion matrix. However, if we keep storing the incoming points every frames, then the storage requirement will grow rapidly. In this paper, the author proposes a greedy approach which reprojecting reconstructed 3d points of the previous frame into the image plane of the current frame. In case a point falls onto a valid disparity then we fuse both 3D points by taking the mean of them. This not only improves the accuracy of reconstruction but also reduces the memory requirement quite a lot. 7

4 EXPERIEMENT In this section, results of previous section are shown. Fig.4.1 shows the result of the feature matching. The dot point is a matched points in the left image of previous frame and the line pointing to the location of the matched points in the left image of the current frame. Around 2K points are matched among four consecutive images. As for egomotion estimation, the Figure 4.1: Feature Matching Among Four Consecutive Images video can be accessed in on web. Please follow the link https://www.youtube.com/watch?v=xvqmtmpk87e. As you can see in the video, the accumulated rotation and translation matrix follow the stereo system accurately. The disparity map can be viewed on web, please following the link ht t ps : //www.youtube.com/watch?v = LJF 0hh_r g vc. Unfortunately, I don t have a project partner nor enough time to merge all the stage together and output the 3D reconstruction map. The performance of each pipeline stage are given below. Stage Feature Matching Egomotion Estimation Disparity Map Time 90ms 35ms 350ms 5 CONCLUSION In this project, I learn a lot about feature extraction, feature matching and egomotion estimation and lots of 3D reconstruction concept. All the code are implemented in C language, including image convolution, linear equation elimination, Gauss-Newton iteration and RANSAC and etc. Even though the performance here is acceptable, it can be boost using SIMD instruction. Finally, my code is located at https://www.dropbox.com/s/qhmedzb4n3qqjrt/finalprojectcode.zip?dl=0 8

6 REFERENCE [1] Andreas Geiger, Julius Ziegler and Christoph Stiller, "StereoScan: Dense 3D Reconstruction in Real-time", Intelligent Vehicles Symposium (IV),2011 [2] Andreas Geiger, Martin Roser and Raquel Urtasun, "Efficient Large-Scale Stereo Matching", Asian Conference on Computer Vision (ACCV),2010 9