Scene Segmentation by Color and Depth Information and its Applications

Similar documents
SCENE SEGMENTATION FROM DEPTH AND COLOR DATA DRIVEN BY SURFACE FITTING. Giampaolo Pagnutti, Pietro Zanuttigh

Scene Segmentation Assisted by Stereo Vision

DEPTH MAPS CODING WITH ELASTIC CONTOURS AND 3D SURFACE PREDICTION

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Platelet-based coding of depth maps for the transmission of multiview images

Outline. ETN-FPI Training School on Plenoptic Sensing

Handheld scanning with ToF sensors and cameras

5LSH0 Advanced Topics Video & Analysis

No-Reference Quality Metric for Depth Maps

Joint Color and Depth Segmentation Based on Region Merging and Surface Fitting

A Probabilistic Approach to ToF and Stereo Data Fusion

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction

Topics to be Covered in the Rest of the Semester. CSci 4968 and 6270 Computational Vision Lecture 15 Overview of Remainder of the Semester

A Novel Multi-View Image Coding Scheme based on View-Warping and 3D-DCT

Segmentation and Tracking of Partial Planar Templates

View Synthesis for Multiview Video Compression

CHAPTER 3 DISPARITY AND DEPTH MAP COMPUTATION

Multi-View Stereo for Static and Dynamic Scenes

Sashi Kumar Penta COMP Final Project Report Department of Computer Science, UNC at Chapel Hill 13 Dec, 2006

EE795: Computer Vision and Intelligent Systems

Cognitive Algorithms for 3D Video Acquisition and Transmission

DEPTH LESS 3D RENDERING. Mashhour Solh and Ghassan AlRegib

Geometric Reconstruction Dense reconstruction of scene geometry

A novel 3D torso image reconstruction procedure using a pair of digital stereo back images

Fusion of Stereo Vision and Time-of-Flight Imaging for Improved 3D Estimation. Sigurjón Árni Guðmundsson, Henrik Aanæs and Rasmus Larsen

On-line and Off-line 3D Reconstruction for Crisis Management Applications

Multiview Depth-Image Compression Using an Extended H.264 Encoder Morvan, Y.; Farin, D.S.; de With, P.H.N.

CONVERSION OF FREE-VIEWPOINT 3D MULTI-VIEW VIDEO FOR STEREOSCOPIC DISPLAYS

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

3D scanning. 3D scanning is a family of technologies created as a means of automatic measurement of geometric properties of objects.

View Synthesis for Multiview Video Compression

Depth Estimation for View Synthesis in Multiview Video Coding

Reliable Fusion of ToF and Stereo Depth Driven by Confidence Measures

Detecting motion by means of 2D and 3D information

Quality improving techniques in DIBR for free-viewpoint video Do, Q.L.; Zinger, S.; Morvan, Y.; de With, P.H.N.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 5, MAY

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

732 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 4, DECEMBER /$ IEEE

Using Shape Priors to Regularize Intermediate Views in Wide-Baseline Image-Based Rendering

Robotics Programming Laboratory

Colour Segmentation-based Computation of Dense Optical Flow with Application to Video Object Segmentation

DEPTH PIXEL CLUSTERING FOR CONSISTENCY TESTING OF MULTIVIEW DEPTH. Pravin Kumar Rana and Markus Flierl

Multiview Point Cloud Filtering for Spatiotemporal Consistency

Recap from Previous Lecture

But, vision technology falls short. and so does graphics. Image Based Rendering. Ray. Constant radiance. time is fixed. 3D position 2D direction

Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera

LBP-GUIDED DEPTH IMAGE FILTER. Rui Zhong, Ruimin Hu

A virtual tour of free viewpoint rendering

Filter Flow: Supplemental Material

Context based optimal shape coding

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

Extensions of H.264/AVC for Multiview Video Compression

Multiple View Geometry

Perceptual Quality Improvement of Stereoscopic Images

View Generation for Free Viewpoint Video System

3D HAND LOCALIZATION BY LOW COST WEBCAMS

CONTENTS. High-Accuracy Stereo Depth Maps Using Structured Light. Yeojin Yoon

Accurate and Dense Wide-Baseline Stereo Matching Using SW-POC

Stereo Image Rectification for Simple Panoramic Image Generation

Multiview Image Compression using Algebraic Constraints

Overview. Related Work Tensor Voting in 2-D Tensor Voting in 3-D Tensor Voting in N-D Application to Vision Problems Stereo Visual Motion

Feature Transfer and Matching in Disparate Stereo Views through the use of Plane Homographies

Including the Size of Regions in Image Segmentation by Region Based Graph

MULTIVIEW 3D VIDEO DENOISING IN SLIDING 3D DCT DOMAIN

Processing 3D Surface Data

Processing 3D Surface Data

Matching. Compare region of image to region of image. Today, simplest kind of matching. Intensities similar.

Fog Simulation and Refocusing from Stereo Images

Hand gesture recognition with Leap Motion and Kinect devices

Theory of Stereo vision system

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

Fundamentals of Stereo Vision Michael Bleyer LVA Stereo Vision

Lecture 10 Dense 3D Reconstruction

Optimized Progressive Coding of Stereo Images Using Discrete Wavelet Transform

Nonrigid Surface Modelling. and Fast Recovery. Department of Computer Science and Engineering. Committee: Prof. Leo J. Jia and Prof. K. H.

A MULTI-RESOLUTION APPROACH TO DEPTH FIELD ESTIMATION IN DENSE IMAGE ARRAYS F. Battisti, M. Brizzi, M. Carli, A. Neri

Stereo Vision in Structured Environments by Consistent Semi-Global Matching

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

Color Segmentation Based Depth Filtering

Conversion of free-viewpoint 3D multi-view video for stereoscopic displays Do, Q.L.; Zinger, S.; de With, P.H.N.

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Deep Learning for Confidence Information in Stereo and ToF Data Fusion

An Approach for Reduction of Rain Streaks from a Single Image

Lecture 10 Multi-view Stereo (3D Dense Reconstruction) Davide Scaramuzza

A Survey of Light Source Detection Methods

VOLUMETRIC RECONSTRUCTION WITH COMPRESSED DATA

55:148 Digital Image Processing Chapter 11 3D Vision, Geometry

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Stereo Matching: An Outlier Confidence Approach

Omni-directional Multi-baseline Stereo without Similarity Measures

Epipolar Geometry CSE P576. Dr. Matthew Brown

Deep Learning-driven Depth from Defocus via Active Multispectral Quasi-random Projections with Complex Subpatterns

CIRCULAR MOIRÉ PATTERNS IN 3D COMPUTER VISION APPLICATIONS

3D Modeling using multiple images Exam January 2008

NEW CONCEPT FOR JOINT DISPARITY ESTIMATION AND SEGMENTATION FOR REAL-TIME VIDEO PROCESSING

CS 4495 Computer Vision A. Bobick. Motion and Optic Flow. Stereo Matching

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Transcription:

Scene Segmentation by Color and Depth Information and its Applications Carlo Dal Mutto Pietro Zanuttigh Guido M. Cortelazzo Department of Information Engineering University of Padova Via Gradenigo 6/B, Padova, Italy {dalmutto,pietro.zanuttigh,corte}@dei.unipd.it Abstract Scene segmentation is a classical problem in computer vision and image processing. Traditionally this problem has been tackled by exploiting only color information from a view of the scene. Recently, a great amount of research has been done in order to develop efficient solutions for depth estimation and tools like modern stereo vision techniques and ToF range cameras allow accurate depth information extraction. The availability of both color and depth information opens the way for new insights in scene segmentation algorithms. In this paper, a novel scene segmentation framework is introduced. The goal of this approach is to capture the scene distribution, taking into account both color and depth cues, in order to provide a more accurate segmentation. Numerous applications can benefit from an effective scene segmentation method, e.g. 3D video, freeviewpoint video and game controlling. In this paper we will also show how the proposed segmentation scheme can be used to improve depth data performances, by using it in order to identify and extract the edges and the main objects in the scene. 1. Introduction Image segmentation is a very challenging task that has been the subject of a huge amount of research activity. There are many different segmentation techniques based on different tools. A first class of methods is based on graph theory and in particular on the formulation of the segmentation problem in terms of a graph-cut problem [2]. Another group of methods are based on clustering algorithms, e.g. the method of [1] exploits the mean shift clustering algorithm for image segmentation. There are also methods based on region merging, level sets, watershed transforms, spectral methods and many other different techniques. Despite the great effort spent on this issue, it remains an unsolved problem. One of the main reasons is that the information content of the image is not always sufficient to properly recognize the object in the framed scenes (e.g. consider an object over a background of the same color). In the last years, many different solutions have been proposed for the extraction of depth information relative to a real world scene. Various systems have been proposed in order to solve this task, each one with pros and cons. There are two main group of methods, passive and active methods. Passive methods use only the information coming from two or more standard cameras to estimate the depth. Among them stereo vision systems, that exploit the localization of corresponding locations in the different images, are perhaps the most widely used. A complete review of this family of methods can be found in [6]. Stereo vision systems have been greatly improved in the last years but they can not work on untextured regions and the most effective methods are also very computational time consuming. Another possible solutions are the active methods such as structured light, laser scanners and Tof sensors [3]. By projecting some form of light on the scene such methods can obtain better results than passive stereo vision systems, but they are also usually more expensive. Depth data can be segmented easier than color images and allows to recognize objects that have similar colors but at the same time close objects with similar depths can not be easily identified (e.g. two close people side-by-side). It is quite common to have both depth and color data relative to the same framed scene and by exploiting both the geometry (depth) information and the color clues it is possible to better recognize the objects in the scene and thus improve segmentation performances. This paper follows this rationale and introduces a novel segmentation algorithm that exploit both depth and color information. In Section 2 we describe the joint depth and color representation of the samples and we introduce the proposed segmentation algorithm. Section 3 presents the experimental results and shows how the joint segmentation algorithm allows better performance than us-

ing color or depth alone. In Section 4 we show how the proposed segmentation scheme can be used for depth data and finally in Section 5 we draw the conclusions. 2. Proposed segmentation algorithm Scene segmentation is the process of analyzing, understanding and labeling the various parts that compose a scene. The proposed segmentation algorithm assumes the availability of both geometrical and color information about a scene. This assumption leads to the fact that given a view of a generic scene S, for each point p i S, i = 1,..., N (i.e. each pixel in the acquired image), it is possible to get its geometrical position (defined by its 3D coordinates in the camera reference frame [x, y, z] T ) and its color value [R, G, B]. While the color value of a point comes directly from the color image of the framed scene, the full geometrical position can be derived from the depth value and the image coordinates of the point p, by simply inverting the camera projection matrix. The input of our segmentation algorithm are: the color image C of the scene S, acquired by a standard camera the depth-map D of the framed scene S (it could have been acquired by any 3D reconstruction method, e.g. a ToF camera or a stereo vision system) the projection matrix relative to the acquired color image and depth-map 2.1. Color image In order to exploit in the best way the color information, it is important to consider a color representation that is well suited for the segmentation task. In particular we choose to adopt a uniform color space, because this kind of color space preserve the distances between the perceived colors in the representation space. In this way the distances used in the clustering process of Section 2.3 are consistent with the perceived color difference. Note also that a uniform color space also ensures that distances in each of the 3 color components are consistent, thus making easier to perform the clustering of the 3-vector representing the color information. In particular the CIELab color representation system has been used for this work. So, for each point p S, a 3-vector is defined in order to account for the color information in the CIELab color space: p c i = L(p i ) a(p i ) b(p i ) (1) 2.2. Computing geometric information While for the color image, the extraction of the information is a well known problem, the extraction of the geometric information for the segmentation task is a new topic, where only a few attempt of exploration has been done. A first simple approach could be considering the depth map as a standard grayscale image, and segment it with standard image segmentation techniques. However, a better way of considering the geometric information comes straightforward from the classical 3D reconstruction problems. The proposed algorithm requires the three dimensional position of each sample in the scene. To compute this information the depth value corresponding to each pixel in the color image is necessary together with the calibration parameters. Many passive and active methods (e.g. laser scanners or ToF sensors) directly provide depth information while in order to get depth information from the standard output of a stereo vision system, a further step is necessary. In fact, the output of a standard stereo vision algorithm is a disparity image D S, that for each point p i represents the distance in image space between its position in the left image of the stereo pair I L, and its position in the right image I R. Given the focal length f of the rectified cameras, and the baseline b of the stereo pair, it is possible to obtain a depth-map for the stereo pair D S from the disparity map D S by applying the well-known equation: D S = bf D S. (2) Given the depth-map D S, it is now possible to obtain a description of the geometrical information, obtaining for each point p i S, i = 1,..., N the geometric information vector: p g i = x(p i) (3) After getting the depth map D T we can compute the three dimensional locations corresponding to the image samples: for each point p i S, given its depth-map value D T (p i ), its coordinates in the depth image [u(p i ), v(p i )], and the intrinsic camera matrix K T, it is possible to obtain its 3D coordinates by simply inverting the projection equation: x(p i ) u(p i ) = D T (p i )K 1 T v(p i ) (4) 1 Given this equation, it is immediate to understand how for each point p S, the 3-D vector that accounts for the geometrical information is: p g i = x(p i ) (5) 2

2.3. Combined color and depth segmentation In order to better balance the relevance of the two kinds of information in the merging process, both the color information vectors p c i, i = 1,..., N and the geometric information vectors p g i, i = 1,..., N are normalized by the variance of the L and the z components respectively: σ 2 L = var{l(p i )}, i = 1,..., N σz 2 = var{}, i = 1,..., N L(p i ) ā(p i ) = 1 L(pi i ) a(p i ) σ b(pi ) L b(p i ) x(p i ) ȳ(p i ) = 1 σ z x(pi i ) From these normalized color and geometric information vectors, it is possible to obtain the set of joint features vectors p f i, i = 1,..., N according to the formula: L(p i ) ā(p i ) p f i = b(pi ) λ x(p i ), i = 1,..., N (6) λȳ(p i ) λ where λ is a parameter that allows the tuning of the contribution of the color and the geometric information. The set of joint features vectors p f i, i = 1,..., N is a set of characteristic features that allows for the synergic interaction of color and geometric information for the segmentation task, that is obtained in a very natural way, well reflecting the real scene distribution in both color and geometry. Given the set of vectors p f i, i = 1,..., N, a direct way to perform segmentation is the application of a clustering algorithm to this set. Numerous clustering techniques are currently available, more or less accurate and more or less time expensive. We decided to apply the standard k-means clustering algorithm because of its convergence speed. We also introduced a final refinement step that consists in removing regions smaller than a threshold t not connected to the main region of each cluster. Finally note how the proposed segmentation algorithm need only three parameters in order to provide effective scene segmentation results: the parameter λ, that allows for tuning the distribution of decisional power between the color and the geometric information the parameter k, i.e., the number of segments that are produced by the segmentation algorithm (it is the number of clusters k of the k-means algorithm). (Eventually) the parameter t, i.e., the size of the smallest allowed unconnected region. 3. Experimental Results To evaluate the performance of the proposed algorithm we tested it over some video sequences from standard multiview plus depth datasets. Figure 1 shows the performance of the proposed algorithm on the breakdancers sequence from Microsoft Research [8]. It refers to the first frame of view 0 and shows the results of the segmentation using only color, only geometry and both of them (the parameters used in this example are λ = 5 and k = 15). In particular note how by using only color information (Figure 1c) the noise in the camera image and some complex color patterns lead to an oversegmentation of the image and to some inconsistent region patterns, e.g. on the back of the breakdancer. Depth information is less noisy and leads to better performances (Figure 1d) but it is not able to capture some objects, e.g. the two spectators on the right that are very close to the wall and do not have a significant depth difference from the background. Figure 1e shows the results with the proposed joint segmentation algorithm: it is possible to see that it is able to locate all the objects in the scene, both the ones that have a significant depth difference and the ones that can be located only from the color data. It is also less noisy than the one based on color information. Figure 2 instead shows the performance of the algorithm on a frame from the orbi sequence. Again notice how by using together color and depth information (Figure 2e) it is possible to recognize the foreground objects but also the board on the back that is not represented in depth data. By comparing Figures 2e and 2f it is also possible to see how the segmented image is modified by the final step that removes unconnected regions. 4. Application of the proposed scheme in depth 3D video and depth image-based (DIBR) representations require the transmission of one or more views together with the corresponding depth maps. While color data can be compressed with standard video techniques [4], ad-hoc solutions for depth data allows to obtain better performances than standard image or video algorithms. Depth maps are usually made by smooth regions divided by sharp edges and a common solution employed in ad-hoc solutions for depth data consist in locating the edges and objects in the scene through an initial segmentation step and then applying some ad-hoc 3

a) b) c) d) e) f) a) b) Figure 2. Segmentation of frame 0 of the orbi sequence: a) color image; b) corresponding depth map; c) segmentation on the basis of color information only; d) segmentation on the basis of depth information only; e) Proposed joint segmentation scheme before the final refinement; f) Proposed scheme with the further refinement stage. c) Segmented regions 1.Segmentation 2.Masks processing 4.Subsampling (segm. data) Depth map d) 5.Subsampling (depth maps) 3.Arithmetic Coding masks* 6. Lossless grid samples* 7.Prediction 8. Lossy residuals* Figure 3. Architecture of the depth scheme of [7] methods [7, 5]. Considering that color information is usually available together with the depth data, the proposed segmentation algorithm can be exploited to improve the performance of these methods. In this connection we considered the scheme of [7] and replaced the initial segmentation step with the algorithm proposed in this paper. In [7] the segmentation step is used in order to identify and extract the edges and the main objects in the scene. Segmented data is then compressed. In the subsequent step an ad-hoc algorithm is used to predict the surface shape from the segmented regions and a set of reg- e) Figure 1. Segmentation of frame 0 (view 0) of the breakdancers sequence: a) color image; b) corresponding depth map; c) segmentation on the basis of color information only; d) segmentation on the basis of depth information only; e) Proposed joint segmentation scheme. 4

ularly spaced samples. Finally the prediction residuals are compressed using standard image techniques. Figure 3 shows the architecture of this algorithm, while a complete description of the system can be found in [7]. By exploiting also the color information the proposed algorithm is able to perform a more accurate segmentation of the depth map thus allowing better performance. Figure 4 refers to a preliminary test on the first frame of view 1 of the breakdancers sequence and shows a comparison between the results presented in [7], the ones of the same scheme with the proposed joint segmentation scheme and JPEG2000 (note that the scheme of [7] was targeted to high bitrate or near lossless ). The proposed scheme allows to obtain a small but noticeable gain (on average 0.5 db, but up to 1 db at around 0.12 bpp). The gain is limited by the fact that here the target is just depth alone and so depth segmentation alone is already quite satisfactory, but the proposed scheme is particularly suited to be used in joint depth and color schemes where the redundancy between the two kind of data must be exploited and the proposed segmentation scheme is a very good solution to locate consistent regions in the two scene descriptions. This problem will be the subject of future research. PSNR (db) 56 55 54 53 52 51 50 49 48 47 46 45 0 0,1 0,2 0,3 0,4 Bitrate (bpp) Proposed with standard segmentation Proposed with joint segmentation JPEG2000 Figure 4. Depth performance on the breakdancers sequence. 5. Conclusions In this paper a novel scene segmentation framework that accounts for both color and geometry is presented. The goal of the paper is to give a representation of both kind of information in a unified way, and to provide a solid background for the application of various clustering techniques. In the preliminary work presented in this paper we introduced a unified vector representation for the samples including color and depth data and applied the k-means clustering algorithm. Experimental results show the effectiveness of the proposed scene segmentation method. The choice of the optimal clustering technique is far beyond the purposes of this paper, and will be the subject of future investigation. We also presented one particular application of the developed scene segmentation algorithm in depth for DIBR schemes. In this direction another aspect that will be investigated is the exploitation of the proposed scheme in joint depth and color. The variety of possible applications is anyway not limited to the data problem, and it is in continuous expansion. References [1] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603 619, may. 2002. 1 [2] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmentation. Int. J. Comput. Vision, 59(2):167 181, 2004. 1 [3] A. Frick, F. Kellner, B. Bartczak, and R. Koch. Generation of 3d-tv ldv-content with time-of-flight camera. In Proc. of 3DTV Conf., 2009. 1 [4] ISO/IEC MPEG & ITU-T VCEG. Joint Draft 8.0 on Multiview Video Coding, Jul. 2008. 3 [5] S. Milani and G. Calvagno. A depth image coder based on progressive silhouettes. Signal Processing Letters, IEEE, 17(8):711 714, aug. 2010. 4 [6] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47:7 42, 2001. 1 [7] P. Zanuttigh and G.M. Cortelazzo. Compression of depth information for 3d rendering. In 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2009, pages 1 4, may. 2009. 4, 5 [8] Lawrence C. Zitnick, Sing B. Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. Highquality video view interpolation using a layered representation. ACM Trans. Graph., 23(3):600 608, 2004. 3 5