MULTIVIEW STEREO (MVS) reconstruction has drawn

Size: px

Start display at page:

Download "MULTIVIEW STEREO (MVS) reconstruction has drawn"

Magdalen Tate
6 years ago
Views:

1 566 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 Noisy Depth Maps Fusion for Multiview Stereo Via Matrix Completion Yue Deng, Yebin Liu, Qionghai Dai, Senior Member, IEEE, Zengke Zhang, and Yao Wang, Fellow, IEEE Abstract This paper introduces a general framework to fuse noisy point clouds from multiview images of the same object. We solve this classical vision problem using a newly emerging signal processing technique known as matrix completion. With this framework, we construct the initial incomplete matrix from the observed point clouds by all the cameras, with the invisible points by any camera denoted as unknown entries. The observed points corresponding to the same object point are put into the same row. When properly completed, the recovered matrix should have rank one, since all the columns describe the same object. Therefore, an intuitive approach to complete the matrix is by minimizing its rank subject to consistency with observed entries. In order to improve the fusion accuracy, we propose a general noisy matrix completion method called log-sum penalty completion (LPC), which is particularly effective in removing outliers. Based on the majorization minimization algorithm (MM), the non-convex LPC problem is effectively solved by a sequence of convex optimizations. Experimental results on both point cloud fusion and MVS reconstructions verify the effectiveness of the proposed framework and the LPC algorithm. Index Terms Compressive sensing, fusion, matrix completion, multiview stereo (MVS), point cloud. I. INTRODUCTION MULTIVIEW STEREO (MVS) reconstruction has drawn significant attentions in a wide range of practical applications, e.g., Google Earth, cultural relics preservation and 3-D games. One prevalent method for MVS reconstruction is the depth map merging approach. It considers generating the whole geometry of the 3-D object from computing the depth maps, also known as the point clouds, generated by multiple views and then fuse these point clouds to get the entire model of the object. For depth map merging MVS, the task of computing the depth map from a pair of calibrated cameras, is a relatively Manuscript received August 06, 2011; revised January 06, 2012; accepted March 25, Date of publication April 19, 2012; date of current version August 10, This work was supported in part by the National Basic Research Project (No. 2010CB731800) and the Project of NSFC (No , , , and ). The work of Y. Deng was supported in part by a Microsoft fellowship. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Levent Onural. Y. Deng, Y. Liu, Q. Dai, and Z. Zhang are with the Automation Department, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing , China ( yuedeng.thu@gmail.com; liuyebin@tsinghua.edu.cn; qhdai@tsinghua.edu.cn; zzk@tsinghua.edu.cn). Y. Wang is with the Department of Electrical and Computer Engineering, Polytechnic Institute of NYU, Brooklyn, NY USA ( yao@poly. edu.) Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /JSTSP well-studied and mature subject. However, the rough depth maps generated via stereo matching own two prominent drawbacks, i.e., redundant and noisy. The depth maps are redundant since a point on the surface of the object may be visible and recovered from a number of views. Furthermore, the recovered 3-D positions of the same surface point by different stereo pairs are often different due to errors in depth-map generation. This is because recovering 3-D information from calibrated images involves multiple steps including camera calibration, feature matching, and hole filling, etc. Any error in each step will disturb the accuracy of the generated depth map. In this work, we propose to reduce the noise and the redundancy among the multiview depth maps from the perspective of matrix completion. In the following, we will use the terms point cloud and depth map interchangeably as both refer to the recovered surface geometry from a stereo pair of camera view. In addition, the word depth refers to the 3-D position of a 3-D point in a predefined world coordinate. It is not difficult to understand the idea of using matrix completion techniques to accomplish fusion of multiview point clouds. From the point clouds generated from all the cameras, we first construct an incomplete fusion matrix. Each column vector in this matrix contains the points seen by one stereo pair of camera, with the visible points regarded as known entries, and invisible points as missing entries. The points from different clouds are ordered so that points corresponding to the same object point are put into the same row. If all the known entries are error-free, and the missing entries are filled properly, the completed matrix should have identical columns, i.e., has a rank of one, since all the vectors describe the same set of surface points. However, because of the noise in the depth maps, the known entries in the same row are not identical. The problem is to try to recover a complete matrix that has rank one, based on the observed entries. Unfortunately, typical matrix completion algorithms [1], [2] could only handle the matrix that is free of noises or only has small noise on the observed entries. In order to overcome the shortcomings of typical matrix completion algorithms, principle component pursuit (PCP) [3] was proposed to address the problem of recovering a low rank matrix from outliers. It completes/recovers a low rank matrix by penalizing the norm of outliers and large noises on the observed entries. Using norm to describe sparse outliers is straight forward but perhaps it is not the most effective one. Previous works [4], [5] have indicated that the log-sum term approximates the norm much closer than the norm and thus it can represent the sparsity of the signal much better. Therefore, inspired by the works of PCP and matrix completion, we propose a robust algorithm, /$ IEEE

2 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 567 i.e., log-sum penalty completion (LPC), for matrix completion from noisy entries. The log-sum penalty can enhance the sparsity of the errors in the matrix, so that outliers in the known entries do not adversely affect the recovered matrix. Although such penalty is effective, unfortunately, it makes the objective function nonconvex. In general, the nonconvex problem can be extremely hard to solve. In order to make the problem tractable, we introduce an effective nonconvex optimization strategy based on majorization minimization(mm) [4], [6], which iteratively replaces the nonconvex component of the objective function with a convex upper-bound. Then, it is possible solve the convex surrogates by alternating direction method [7] with augmented Lagrange multiplier [8]. The proposed framework and algorithms will be verified from different perspectives using both synthetic and real datasets. All these experiments demonstrate the effectiveness of the proposed framework to improve the fusion accuracy. The main contributions of this paper consist of following two aspects: 1) The proposed fusion framework using matrix completion solves an old topic from a fresh perspective and improves the fusion accuracy of MVS in spite of noises and outliers. The matrix completion approach may be regarded as the mathematical filter to reduce the noise levels of the point clouds for any available depth map estimation methods. To the best of our knowledge, this is the first work of using matrix completion method for depth maps fusion. 2) The proposed LPC matrix completion algorithm could effectively complete a low rank matrix from noisy observations, particularly when the observed entries suffer from non-gaussian noise (such as outliers and bias). LPC is not restricted only to the scopes discussed in this work and we believe that it is a general algorithm for noisy matrix completion to solve many other practical problems [9]. The remainder of this paper is structured as follows. Section II reviews related works on MVS reconstruction and matrix completion with the purpose to highlight our contributions from previous works. The algorithms for point cloud extraction and initial fusion matrix construction will be introduced in Section III. Then, we propose a robust matrix completion algorithm to complete the fusion matrix in Section IV. The experimental verifications will be conducted in Sections V and VI. The former section evaluates the point clouds fusion accuracy from both synthetic and practical data. The latter compares the MVS reconstruction results for a public domain data set, by using the proposed method and prior approaches that do not use the matrix-completion framework. We conclude this paper in Section VII. II. RELATED WORKS AND CONTRIBUTIONS In this part, we first review previous works on MVS reconstruction, especially on depth-maps-merging algorithms. This will be followed by a review on related matrix completion algorithms. The differences of this paper from previous works and our contributions will be presented and discussed. A. MVS Reconstruction and Depth Map Fusion Following the taxonomy of Seitz et al. [10], the stereo based reconstruction algorithms are generally categorized into four kinds: 3-D volumetric approaches [11], [12], feature expansion [13] [15], surface evolution techniques [16], [17], and depth map-based methods [18] [21]. Volumetric approaches compute a cost function on a 3-D volume (grid) and extract a complete surface from the discrete voxels. Fruckawa et al. proposed the feature propagation method. They generate the 3-D information from the feature points on the 2-D images. Then, the whole 3-D model is expanded via these feature points. Surface evolution-based algorithms seek for the accurate reconstruction by optimizing the meshes on a rough 3-D surface. Depth-map-based reconstruction refers to the problem of generation and fusion of multiple partial depth maps. The first step is to generate the depth maps from calibrated images. Campbell et al. [22] considered extracting depth maps via discrete Markov random field (MRF). Gargallo et al. [23] proposed a Bayesian model for depth maps generation. The depth information is optimized via EM algorithm. Liu et al. [24] adopted the variational depth estimation approach and chose the best match from multi-candidates to extract high quality depth maps. However, the point clouds generated via stereo matchings over multiple camera pairs are generally not consistent. Therefore, a number of point cloud fusion strategies concentrated on how to reduce the redundancy and improve the accuracy of the fused point cloud. Merrel et al. proposed a fast algorithm for depth map fusion [19]. They considered selecting the most representative depth information by projecting it to the original calibrated images with the least distance. Then, the redundancy among the rough point cloud is reduced by clustering. A similar idea is also adopted in Liu s work [18] and Bradley s work [25]. They took advantages of the normalized cross correlation (NCC) metric to filter the depth maps. Many previous works including [22], [25], [26] use such physical filtering method to reduce the noise levels in depth maps. These physical filtering methods are easy to be implemented and are effective to handle outliers with some desirable metric. In [26], the authors indicated that the NCC metric is effective to remove isolated patches from the surface. The fusion methods reviewed above can all be categorized as physical filtering since they take advantages of the physical properties of cameras and surface constraints to remove the noises in the generated depth maps. However, physical filtering alone is often not enough. For example, in Fig. 1(a), the point clouds are generated via stereo matching algorithms and the noises are preprocessed by the NCC projection methods suggested in [26] and [24]. Obviously, there are still a large quantity of errors, e.g., the black dots which indicate outliers, around the surface. In this paper, we introduce a mathematical filtering method to further improve the accuracy of the point cloud obtained after physical filtering. Some preliminary and naive mathematical filtering approaches have been used for this task. In [19], clustering method is used to fuse the nearby noisy points in different clouds. In this paper, we will introduce a fusion method based on matrix completion, which has been demonstrated to be a robust mathematical filter in a number of applications [3], [27], [9]. The proposed matrix fusion algorithm could further reduce the redundancy and improve the accuracy of the fused point cloud.

3 568 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 i.e.,. Therefore, the matrix completion problem can be effectively solved by minimize subject to (3) The optimization in (3) can be addressed by relaxing the equality constraint into the objective function minimize (4) Fig. 1. Point cloud fusion on practical depth maps with different mathematical filtering strategies. (a) Rough point cloud, (b) K-means fusion, (c) Average fusion, and (d) MC-fusion (LPC). B. Matrix Completion and Its Applications Low rank matrix completion is a well studied but still challenging problem in the field of signal processing. It considers how to complete a low rank matrix from only a small portion of observations. We first introduce the typical matrix completion problem. Suppose we are given partial information about an unknown matrix and only information available about is a subset of entries, where indicates the location of known entries. Based on, the sampling operator : is defined by (1) otherwise The matrix completion problem attempts to recover the matrix only from. Recent progresses in compressive sensing indicated that when the rank of the matrix is low, the matrix can be recovered from the incomplete matrix by solving the following optimization problem: minimize subject to (2) where is the decision variable and should be recovered via optimization. Unfortunately, solving this problem is proven to be NP-hard and all known algorithm for exactly solving it are doubly exponential in theory and in practice [28]. A modern approach for solving this problem is to optimize its convex envelope for substitution and thus solving it via convex relaxation [28]. Nuclear norm, denoted as, is the convex envelope of. Assume matrix has nonzero singular values of, i.e.,. The nuclear norm of is defined as the summation of these singular values, where is the Frobenius norm 1. Parameter starts from a small positive constant (e.g., ) and is then iteratively increased following, where is an ascending parameter that is set to in our simulation. Equation (4) is a typical form for matrix completion, which could be solved by the fixed point continuation (FPC) method [1] or by the proximal gradient (PG) method [29]. The label (TMC) stands for typical matrix completion problem. Although typical matrix completion algorithms are powerful for many practical applications, unfortunately, they cannot be directly applied to complete the fusion matrix which tends to have outliers in the known entries due to large errors in depth estimation. In [2], Candès has proven that the formulation in (4) could only handle small Gaussian noises. When we applied this method to a fusion matrix generated by practical depth estimation algorithms, the algorithm often fails to converge to a correct solution. See the experiments in Subsection V-C for detail. In order to overcome the dilemma of typical matrix completion algorithm, principle component pursuit (PCP) [3] was proposed to recover a low rank matrix with sparse outliers on the observed entries. PCP was developed based on the assumption that outliers only occupy a small portion of the total number of observations. Therefore, it introduces another unknown sparse matrix to describe the outliers and recovers both and by solving minimize subject to In this paper, we refer this method as -comp. to reflect the fact that it minimizes norm of. The norm accumulates the absolute values of all the entries in a matrix, i.e.,. As suggested in [3], the penalty parameter can be fixed as:, where is the portion of the observed entries and is the larger of the row number and column number of the matrix. Theoretical discussions on why (5) could remove both outliers and Gaussian noises are provided in [30]. Matrix completion arose from a number of applications in practical world, e.g., collaborative filtering [31], global positioning [32] and system identification [33]. With respect to computer vision, Tong et al. [34] made use of matrix completion 1 Frobenius norm is equivalent to the ` norm. However, ` norm is typically used to describe vectors and F -norm is mainly for matrix. (5)

4 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 569 for light transport. Based on the traditional light transport equation, they attempted to recover the light transport matrix from different light sources. Garg et al. [35] demonstrated that real world scene appearances are distributed under some low-rank subspaces. They used the principle component analysis (PCA) method to justify that the same scene under different lighting conditions exhibits low rank structure using 2-D images. Recently, low rank matrix recovery models were successfully applied to occluded faces completion [36] and face image alignment [37]. Fu et al. take the advantages of matrix completion for adaptive compressive sensing image reconstruction [38]. In this paper, we will extend the power of matrix completion to the field of noisy point cloud fusion. In our model, the matrix completion model serves as a robust mathematical filter to remove the outliers and conflicts in the rough depth maps generated by computer vision algorithms. Beside, we will design a robust matrix completion algorithm called log-sum penalty completion (LPC). Owing to the power of the log-term and the reweighted approaches, we can effectively complete the noisy matrix by a sequence of convex optimizations. III. CONSTRUCTION OF INCOMPLETE FUSION MATRIX In this part, we will briefly introduce our variational method for depth maps generation [24]. All the practical depth maps used in this work are generated by such method. Then, based on the generated depth maps, the method for constructing the initial fusion matrix will be introduced. Finally, we will discuss the noises in the fusion matrix. A. Depth Maps Generation 1) Variational Depth Estimation: Depth maps estimation is a classic problem in computer vision and there are multiple choices for this topic, but in this paper, we will follow the material in our previous publication [24], which explores the consistency between neighboring images for depth estimation. Consider depth estimation using two views: target image and reference image. We use a disparity vector to denote the shift from the pixel position on the target image to the optimal correspondence on the epipolar line in the reference image. We use to denote the object area in the target image, and the disparity field, which contains all the disparity vectors in. Inspired by the works on optical flow optimization [39], we obtain the optimal by minimizing the following energy function: The first term represents the data consistency between corresponding pixels in the reference and target images. The second term measures the smoothness of the disparity field. The data consistency term is defined as (6) (7) where denotes the illuminance of the pixel on the target image and is the illuminance of pixel on the reference image. In the integration of (7), the first term is designed to describe the color consistency and the second term, i.e., the first-order gradient, guarantees the illumination robustness. Since the data constraint is not always accurate due to many uncertainties, e.g., noises, occlusions and brightness changes, we apply the robust function proposed in [39] to enhance the robustness of the optimization. is an occlusion map and we will explain its effectiveness later. The smoothness term is defined as which avoids significant changes in the disparity vectors of neighboring pixels. Following [39], is a robust function which measures total variance (TV) of the disparity. The problem of minimizing the energy function in (6) can be converted to the Euler Lagrange equation via the variational method. More details can be found in [24], [39]. After getting the disparity between two views, it is not difficult to calculate the depth information at each pixel based on the known camera parameters. We apply the variational depth estimation with coarse-to-fine strategy under five different scales and obtain five depth maps for each pair of views. From the five candidates, a refined depth map for each pair of views is then generated via zero normalized cross correlation (ZNCC) metric. See [24] for details. 2) Occlusion Detection: In order to improve the robustness of optical flow estimation, following the ideas in [40], an occlusion map is added in (7). The occlusion map only contains binary values which is set to be one at non-occluded positions and zero otherwise. The introduction of such an occlusion map guarantees that only the non-occluded pixels are involved in the optical flow optimization. We estimate the disparity field and the occlusion map iteratively, following [40]. Briefly, to estimate the occlusion map at a new iteration, we first calculate the raw disparity from the left image to the right image, and the disparity from the right to the left image, respectively. If the variance at a certain location is larger than a threshold (e.g., 1.5 pixels), the corresponding position is regarded as occluded pixel. Since the disparity fields are optimized by variational iterations, the occlusion map are also updated in an online manner. Initially, we assume that all the pixels are not occluded. With the disparity optimization going on, the occlusion map is iteratively updated. Meanwhile, the estimated occlusion map is incorporated into the disparity estimation of the next iteration. Note that we cannot calculate the depth information at occluded pixels. However, with the proposed matrix fusion framework, missing depth information at such pixels will not significantly affect the final result since such depth values can be inferred from other views via matrix completion. Section V-E1 contains more discussions on this issue. 3) Physical Filtering: The depth maps generated via aforementioned approach may contain many errors and bad matches. Before proceeding to matrix fusion, we will first remove some (8)

5 570 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 Fig. 2. Construction of the incomplete fusion matrix. (a) The visibility of cameras and (b) incomplete matrix entries clustering. obvious errors by physical filtering. We mainly follow three steps in concatenation to remove the outliers and bad matches. NCC metric. As suggested in [24] and [25], we first use the NCC metric to select the best depth information for a pixel and abandon some bad matches under certain threshold. Disconnected point. If a 3-D point in the world coordinate has a small number of neighbors, the point is regarded as a disconnecting one and should be filtered out. In our method, points with less than neighbors are removed, where is average number of neighbors of all the points. Inaccurate solutions. Moreover, we remove the points for which the angle between the normal and the camera axis is larger than 45 degree. The large angle means that the point is not accurate and a better 3-D position can be obtained from another view. B. Incomplete Fusion Matrix Construction Before introducing the matrix construction algorithm, we will first define some notations. In the following, we use to describe the th camera and is the image captured by this camera, with denoting the pixel in. The corresponding 3-D point for pixel is denoted by. The set of points constitute the point cloud and will be denoted by. In our simulations, is generated by selecting as the target image and as the reference image. Due to occlusions and camera positions, the visibility of each camera is quite limited as illustrated in Fig. 2(a). The visible surface from camera covers the area of the yellow region plus the red region. For view, the visibility covers the area of the blue region plus the yellow region. The yellow region represents the joint/overlapping region of the two cameras. 2 By assuming that can see the whole geometry of the 3-D object, one could denote all the 3-D surface points seen from as a vector. The visible points, i.e., the area BD, are known entries in. While the invisible parts, i.e., DB (counterclockwise), are unknown entries. When there are cameras distributed around the object, we can form incomplete vectors. We stack these vectors as a matrix and obtain 2 In Fig. 2(a) we use a curve segment to denote the visibility of cameras. In practice, the visibility of camera is an area on the 3-D object. (9) In this incomplete matrix, each column consists of points seen by one camera, whereas each row consists of the points seen by all the cameras that correspond to the same surface point of the object. Note that a point has three coordinates expressed by: and. Therefore, there should be three matrices and. However, for the clarify of notation, we use the matrix to represent any one of these three matrices. As will become clear, these matrices are generated simultaneously as we identify points detected from different views that correspond to the same object point. However, after their construction, each incomplete matrix is completed separately using the same algorithm. At this point, it may not be clear to the reader how to determine the number of rows of the matrix and how to determine which points from different views should be put in the same row. This will be discussed next. As shown in Fig. 2(b), let the estimated 3-D point corresponding to pixel be denoted by. If we project to another image, we will get its 2-D correspondence pixel. 3 Because of depth estimation error and calibration inaccuracy, this pixel is not necessarily the real correspondence. Therefore, pixels in the -neighboring region 4 of the, i.e., the pink region in Fig. 2(b), are first back-projected to the 3-D space to get the point set in the th point cloud. Then the nearest point is defined as (10) We denote the corresponding pixel to this 3-D point in image by. If the distance between and is less than a specific threshold, these two points are regarded as describing the same surface point and will be placed in the same row. Otherwise, this entry for the th column is denoted as an unknown entry. For each pixel in image and its corresponding 3-D point, we traverse the points of each image, to search for the nearest point, and if, for example, only cameras and can detect this point, the row vector for this point will be like. For those cameras that can not see this point, their corresponding entries 3 If the projection is not an integer pixel, we choose its nearest integer pixel as I 4R is fixed as 10 pixels in our experiments.

6 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 571 Fig. 3. Illustrations of the noise and errors in the point cloud in the MVS work. (a) The Gaussian noise, (b) the bias noise, (c) the outliers among point cloud, and (d) a toy matrix. In each sub-figure, the blue triangular represents camera. are unknown (labeled with ). For cameras that can detect the surface point, we mark the corresponding pixels as detected pixel. The above projection and matching procedures only construct one row for the incomplete matrix. We repeat the clustering procedure until all the pixels in all the images are marked as detected pixel. The detail steps for constructing the incomplete fusion matrix are summarized in Algorithm 1. Although it involves many loops in the algorithm, the construction is fast due to the marking strategy. A previously detected pixel will not be traversed again in any loop, which means each pixel will be used only once. C. Noises in the Incomplete Fusion Matrix In the previous subsection, we have explained the incompleteness of the fusion matrix. In this part, we will analyze another critical property of the fusion matrix, i.e., the observed entries of the fusion matrix are full of noise. The noises come from the original depth maps, which may be caused by various reasons, but, for the sake of simplicity, we categorize them into three kinds. The first one is the Gaussian noise, which is illustrated in Fig. 3(a). The recovered point clouds (blue dots) all lie within a certain distance [the distance is labeled as in Fig. 3(a)] to the ground truth position. The Gaussian noise can be modeled by a zero-mean normal distribution with variance as. The second one is the bias noise. The bias noise is mainly caused by calibration errors. In Fig. 3(b), for instance, the real position of the camera, i.e.,, is represented by the blue triangular. However, its extrinsic parameter is mistakenly calibrated to the position of the red triangular. Therefore, the point clouds recovered by this camera (represented by the red dotted line) all have the -bias to the real position of the point cloud (represented by the blue line). The bias noise, i.e., usually can be modeled by a -mean normal distribution as. Outliers also widely exist in the estimated point cloud. An outlier is a wrongly identified 3-D point that is far away from the real surface. For instance, in Fig. 3(c), the green dots are outliers, which are far away from the real surface point in red. Compared with the point cloud disturbed by Gaussian noise, e.g., the blue dots, the outliers have a large distance to the ground truth. Fortunately, with a good depth estimation algorithm, the outliers only occupy a small portion of the identified points by all the cameras. Although the physical filtering method can pre-remove some significant outliers and isolated patches, there may still exist some outliers distributed close to the surface, and hence are not removed by physical filtering. Since the outliers do not follow any specific distribution, it is reasonable to use the uniform distribution to model it as. represents the uniform distribution and are the boundaries of the uniform distribution. Fig. 3(d) illustrates an example of the incomplete matrix with missing entries and the above three kinds of noise. IV. LOG-SUM PENALTY FOR MATRIX COMPLETION Since all the vectors in the matrix, i.e.,, describe the same surface, in the ideal situation that all depth maps are error free, and when we can fill all the missing entries properly, the recovered fusion matrix should have rank 1. However, what we can get from the depth maps is a noisy and incomplete matrix due to the errors in the estimated depth maps and the visibility constraints of the cameras. The low rank prior inspires us to complete matrix from by minimizing the rank

7 572 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 of subject to the consistency with the observed entries in. In this part, in order to make a robust and accurate completion of the noisy matrix, the log-sum penalty completion (LPC) algorithm is proposed and we will introduce the majorization minimization (MM) method to solve it. A. Log-Sum Penalty Completion Ideally, to identify the sparse outlier noise, one should minimize the norm of, instead of norm, as in (5). However, that will lead to a non-convex optimization problem that is hard to solve. PCP [3] chooses to use the norm, so that the problem can be solved efficiently. Nevertheless, a number of research works on sparse signal recovery have indicated the limitation of approximating sparsity with norm, e.g., [5] and [41]. Some other approximations are better choices for sparse representation rather than norm. It has been shown that the log-sum term lies between the norm and the norm [5]. Therefore, in this work, in order to make a more accurate completion of the fusion matrix, we propose the following formulation, called log-sum penalty completion (LPC): minimize subject to (11) Although we have placed a powerful penalty to enhance the sparsity of the noise in LPC formulation, unfortunately, it also causes non-convexity in the objective function. It is well known that the log-function over is concave. In optimization theory, the non-convex problem can be extremely difficult to solve, but in our LPC model, the convex upper bound of the concave component can be easily defined. Accordingly, we will use the MM algorithm to minimize its convex upper bound in an iterative manner. The majorization minimization algorithm replaces a hard by constructing a convex upper bound of the objective function at the new iteration and then to minimize the upper bound. To see how does the MM algorithm work for LPC, recall that a concave function is bounded by its first-order Taylor expansion. -function is concave over,so. Therefore, it is straightforward to find the upper-bound of the objective function in LPC by (12) In (12), 5 is an upper-bound of for any chosen. It is worth noting that minimizing is equivalent to minimizing since the last term in is independent of and thus can be dropped out. In the th iteration of the MM algorithm, we set according to the previously solved. The only difficulty remaining here is how can we minimize the function. We will rewrite in the form of matrix and get minimize subject to (13) In (13), the operator in the error term denotes the componentwise product of two variables, i.e., for and :. is the weight matrix with entries:. is a small constant to avoid zero appearing in the denominator. The objective function in (13) differs from (5) only in the expressions in norm, where (13) places a weight matrix to reduce the influence of the large noises. Here, based on the MM algorithm, we have converted the non-convex LPC optimization to a series of convex reweighted problems. We call it reweighted method (REW) since we use the updated to iteratively penalize the large noises in the fusion matrix. (13) is convex and there are quite a number of methods that can be used to solve it, e.g., with the proximal gradient (PG) algorithm [42] or augmented Lagrangian multiplier (ALM) [7], [8]. We choose to use the ALM algorithm in our simulations because it is more effective and efficient. Specifically, we relax the equality in (13) and instead to minimize (14) where and is the Lagrange multiplier matrix, which can be updated via dual ascending method. is the inner product of two matrices. Equation (14) contains two variables and and is convex. Accordingly, it is possible to solve the bi-variables problem via distributed optimization. Here, we will use the alternating direction method (ADM) to address the problem. The prominent reason that we use the ADM is due to its nice convergence. The convergence of the ADM for convex problems has been widely discussed and proven in a number of works [7], [43]. By ADM, we can minimize the equation in (14) by three steps: -minimization, -minimization and dual ascend. Interested readers are referred to [8] for detail discussions. We now give the whole framework for solving the LPC model in (11) via reweighted schemes in Algorithm 2. A general form of the proposed LPC algorithm and its theoretical properties are discussed in [9]. 5 Here, the form of g(a; N; Y) means that the function varies only with respect to the variables (A; N) and Y is known.

In the inner iterations (from line 3 to line 9 in Algorithm 2), it solves a PCP-like convex optimization, with each new iteration using updated Lagrangian multiplier and parameter (see line 7 in

8 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 573 Fig. 4. Toy example for a rank-one matrix completion with different methods. The whole framework in LPC involves two parts: outer and inner iterations. In the inner iterations (from line 3 to line 9 in Algorithm 2), it solves a PCP-like convex optimization, with each new iteration using updated Lagrangian multiplier and parameter (see line 7 in Algorithm 2). In the outer iterations, LPC updates the weight matrix based on the previously recovered outlier matrix and constructs the convex surrogate. We define the inner and outer convergence rate using and, where the superscript denotes the counts of outer iterations. We regard the inner and outer convergence are achieved when and, respectively. Experimental discussions on convergence analysis are provided in Section V-D2. B. LPC for Rank-One Matrix Completion LPC is a general framework for low-rank matrix completion from noisy observations. However, in the fusion work, the goal is to extract a rank-one component from the matrix. Some other simple methods, e.g., average each column or just randomly sampling may also lead to a rank-one solution. Besides, these methods are all computationally more efficient than matrix completion. However, one prominent property of the fusion matrix is that the observed entries of the incomplete matrix are corrupted or disturbed by all kinds of noise, not just the Gaussian noise. We use a toy example to demonstrate the inadequacy of these methods in Fig. 4. In the example, we use a 1-D vector to construct a rank-one matrix by repeating it multiple times column-wise. Each element in the 1-D vector is generated by randomly selecting a number between 20 to 230 because the RGB color ranges from 0 to 255. From the rank-one matrix, we randomly sample out 20% entries as unknown entries. We calculate the average value and variance of all the entries of the 1-D vector. The remaining entries are added with three kinds of synthetic noise: all the observed entries are disturbed by Gaussian noise ; 50% of the observed entries are disturbed by bias noise and 20% of the observed entries are added by corruptions. Based on such a noisy and incomplete matrix, we complete it and extract the rank-one structure via three methods. First, we use the average method to complete the noisy matrix. The missing entries of the matrix are filled with the average value of all the observed entries in the same row. The rank one component is extracted by averaging all the observed entries in one row. The completed matrix and the 1-D vector is shown in Fig. 4. The left part of the Average result is the complete matrix. The right part of the result is the comparison of the ground-truth vector (left) with the rank-one component extracted by averaging (right). From the comparison, it is obvious that the result is not satisfactory. Then, we advocate the usage of optimization strategy to complete this noisy matrix. For MC method, we first try the typical matrix completion (TMC) method in (4) to complete the noisy matrix. Unfortunately, it does not converge to a reasonable solution 6 by neither the FPC solver [44] or the APG solver [29]. Then, we use the.-comp in (5) and LPC to, respectively, complete the noisy matrix. The results are provided in the middle and right part in Fig. 4. After matrix completion, we extract the rank one component from the matrix. The rank one component extraction strategy is based on singular value decomposition (SVD). That is, we set the higher order singular value be zero and just use the rank one component. From the visual comparison, some noises cannot be removed by the -comp while the LPC provides a much clean rank one matrix. Although the rank-one components generated by both methods appear to be identical to the ground truth vector, the recovery accuracy of LPC is while that of -comp 6 The returned matrix has rank = 0 with all the entries be zero, when using the default settings of the public domain program in [1].

9 574 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 is. This verifies that LPC could complete the matrix with higher accuracy. Here, we define the matrix completion accuracy by (15) where is the ground truth vector that we used to generate the matrix and is the vector obtained by matrix completion. From the toy example, it is found that both two matrix-completion-based methods could accurately recover a rank-one matrix and the LPC method generates the 1-D vector that is closer to the ground truth vector. The average method, although simple and efficient, could not accurately complete a rank-one matrix due to the noises and outliers on the observed entries. V. NOISY DEPTH MAP FUSION RESULTS USING MATRIX COMPLETION FRAMEWORK In the preceding sections, we have introduced the fusion framework and have proposed the LPC to complete the fusion matrix. In this section, we will investigate whether the proposed framework is effective enough to remove the noise on the rough point cloud. In the first two experiments, we will add synthetic noise to the ground truth depth maps. Then, we will use the practical noisy depth maps that are generated via our depth estimation method to verify the power of the fusion framework. The ground truth model is Fountain-P11 [45] in EPFL dataset [46]. A. Experimental Setup Before starting the fusion experiment, we will first specify the parameter choosing strategy and explain some details of the experiment. The proposed fusion framework on matrix completion is a robust method. The matrix construction (Algorithm 1) and completion method (Algorithm 2) can be automatically processed without too many parameters to be tuned. Two parameters that should be specified are the error tolerance for matrix construction and the error penalization term for matrix completion. Parameter controls the error tolerance for identifying the same object point. It directly affects the total numbers of rows and the portion of the known entries of the constructed matrix. Empirically, in this paper, we set so that there are 50% to 55% known entries in the fusion matrix. This is because we have found through simulations that we can always achieve a high fusion accuracy on the matrix with such a choice. Parameter controls the balance between the low rank expression (nuclear norm term) and noise removal. As stated before, in [3], the authors suggested to use. However, from our experiments, we have found that using this default parameter does not always lead to the best fusion accuracy. The fusion accuracy can be further improved if we search for the best in the range of, where. This selection strategy could guarantee that the first singular value of the completed matrix is 1000 times larger than the second one, in our experiments. The detailed discussions of the selection of parameter is given in Section V-E2. Another point that should be emphasized here is about the complexity associated with matrix completion for large quantity point clouds. Theoretically, it is feasible to complete any arbitrary large matrix by the proposed LPC method. However, in practice, large matrix may require extremely large memory space. Therefore, for practical usage, in cases of large fusion matrix, we divide it to several small sub-matrices and each matrix is with less than rows. Then, we complete those sub-matrices by the parallel computing method in Matlab. B. Synthetic Noise to Fusion Matrix To evaluate the effectiveness of the proposed LPC for completing a low rank matrix given the initial incomplete matrix, in the first experiment, we directly add the noises to the entries in fusion matrix obtained from a ground truth depth map to see whether the LPC model is robust to remove such kind of noises in the matrix. The ground truth depth map we used here is the mesh set of the fountain-p11 [45] which is acquired by 3-D scanner. Since the fountain model is extremely large, we only select the central part of the fountain model. The two sides of the fountain model are flat walls and the central parts contain all the details about the fountain. We down-sample the center part to discrete points. The point set is denoted as and each point in the set is denoted as. For the matrix construction, we assume that there are 11 cameras placed around the point cloud. 7 We further assume that each point in can be seen by cameras, where is a random integer which uniformly ranges from 5 to 10. The visibility of cameras is simulated by a sequence of random generated numbers. If a random number is generated, we assume that the current point can be seen by the th camera. If this point can be seen by the th camera, the corresponding entry in the fusion matrix will be assigned a value. The term is the error added on this entry. According to the discussions in Section 3-C, the noise can be categorized as three kinds: Gaussian, bias, and outliers. Since we suppose that one point can be seen by cameras, so the corresponding entries (randomly selected) in the th row of the fusion matrix will be denoted as a known value, while the remaining entries are unknown. From the observed data in the constructed matrix, we calculate the mean and standard deviation. In this experiment, Gaussian noises are independently generated following and are added to all the observed entries in the fusion matrix. The bias noise is brought into the matrix by the inaccuracy of the estimated camera s extrinsic parameter. So it can only affect some cameras. In the fusion matrix, columns correspond to the series of cameras. Thus, we randomly select three columns in the fusion matrix and add bias noise, with distribution. The operator controls the bias direction. For outliers, we randomly select 20% of all the observed entries and the value of the corrupted noise uniformly ranges in. From the noisy fusion matrix, we complete the low rank matrix via three different methods. First, we use the proposed LPC method to recover the rank-one matrix and the corresponding fusion result is shown in Fig. 5(a). For comparison, from the 7 In the dataset, they provide 2-D images from 11 views of the fountain. So in this virtual simulation experiment, we also fix the number of cameras as 11.

DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 575 Fig. 5. The matrix completion result when synthetic noises are added to ground truth fusion matrix.

same noisy matrix, we complete the matrix by completion and by averaging each row, respectively. The corresponding fusion results are shown in Fig. 5(b) and (c), respectively.

10 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 575 Fig. 5. The matrix completion result when synthetic noises are added to ground truth fusion matrix. (a) to (c) provide fusion results with LPC, ` completion and 1-D average. (a) LPC (5:4e 0 3), (b) ` completion (2:1e 0 2), (c) 1-D vector average (8:9e 0 2). same noisy matrix, we complete the matrix by completion and by averaging each row, respectively. The corresponding fusion results are shown in Fig. 5(b) and (c), respectively. Since the fusion matrix is virtually generated, we know the real value of each point. Therefore, we can evaluate the matrix completion accuracy by the criterion in (15). The completion accuracies of these three methods are provided in Fig. 5. From both the visual comparisons and the completion accuracies, it is observed that the LPC method achieves the best completion result. In this subsection, we added noise to the incomplete fusion matrix where known entries are obtained from ground truth data. This experiment is designed to emphasize the power of the proposed LPC model for noise removal and low rank matrix completion. However, it cannot demonstrate the effectiveness of the whole fusion framework which contains two critical steps: 1) incomplete matrix construction and 2) LPC method for noisy matrix completion. The results in this subsection however highlight the power of the second step. Therefore, in the next subsection, we will add synthetic noise to the depth maps directly and then construct the fusion matrix from the noisy depth maps. C. Synthetic Noise to the Depth Maps In order to add noise to the depth maps, we first generate the ground truth depth maps of each view. According to the camera parameters, we project the ground truth model to each view and generate 11 depth maps. One point on the th depth map is represented by the element, which represents the 3-D position of the point corresponding to pixel in the th view. However, the resolution of each image is quite high which leads to a huge number of points that is beyond the computational resources of our computer. In this experiment, we down sample the ground truth depth maps at each view and generate a total number of 1.17 million discrete points. We then add three kinds of noise to the ground-truth depth maps. Before adding the noise, we calculate the statistic expectation and standard deviation of all the points. First, the noises are added separately. That is, we only generate one kind of noise to the depth maps each time. For the Gaussian noise, we use a distribution. The bias noise TABLE I DETAILS ABOUT THE CONSTRUCTED INITIAL FUSION MATRIX WITH SYNTHETIC NOISE DATA is generated via. The direction operator is randomly assigned to each view and they are kept the same on the same view. The outliers are added to 30% pixels on all depth maps which follow the uniform distribution. Finally, we added all these three kinds of noises simultaneously to the depth maps. On these noisy depth maps, we first follow Algorithm 1 to construct the initial fusion matrix. The parameter setup and some details of the constructed fusion matrices are listed in Table I. In the table, is the parameter that we used to construct the fusion matrix. The columns #rows and #columns denote the number of rows and columns of the fusion matrix. The column reports the portion of known entries of the matrix. Recall that is chosen so that the portion is around 50% 55%. The first three rows are the matrices generated with a single type of noise and the last row All means that three kinds of noises are simultaneously added. After constructing the fusion matrices, the incomplete fusion matrices are completed by different methods. The fusion results are evaluated based on the ground truth point cloud. The point evaluation strategy is provided in Appendix A. The evaluation results with different types of noises are tabulated in Table II, where we report the average distances between each recovered point and closest triangular surface patch in the ground-truth model (in meters). We first evaluate the original rough noisy point clouds 8 disturbed by different noises and the corresponding evaluation results are provided in the first row of Table II. Then, we use dif- 8 We project all the noisy points from each view in the world coordinate and get the rough point cloud.

11 576 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 TABLE II ACCURACY OF RECONSTRUCTED POINT CLOUD FROM SYNTHETIC NOISY POINTS BY DIFFERENT MATRIX COMPLETION METHODS TABLE III PARAMETERS AND QUANTITATIVE EVALUATIONS FOR PRACTICAL POINT CLOUD FUSION ferent matrix completion methods to complete the incomplete matrix. Average: In average completion, the average value of all the observed entries in each row are used as fusion result. For average fusion, it achieves good performance in removing the Gaussian disturbance. However, it is not effective in handling the bias noise and the outliers. TMC: TMC refers to the typical matrix completion method. Here, we use the FPC [1] to solve the TMC equation in (4). The noise tolerance of TMC is quite limited. It cannot cope with the matrix with large disturbance on the observed entries. Therefore, it diverges on the matrix constructed by the point cloud with outliers, where we use the label to denote it. Nevertheless, compared with the average method, it achieves better performance on the point cloud with Gaussian and bias noises. -completion: The experimental result here demonstrates that -completion is powerful to fuse the point clouds with outliers. Generally, by using the norm to penalize the great disturbances, it improves the fusion accuracy over the averaging and TMC method when there are bias and outlier noise in the depth maps. However, the fusion accuracy can be further improved by LPC which uses a better approximation to describe the sparse errors. LPC: LPC achieves the best fusion results on all the conditions except the Gaussian noise, in which case it is still better than -completion. It is worth noting that when three noises are simultaneously added to the point cloud, LPC could significantly improve the fusion accuracy over all other methods. D. Fusion of Practical Depth Maps 1) Fusion Results: In the previous two subsections, the experiments are conducted on the point cloud with synthetic noises. In this part, we will demonstrate the power of the fusion algorithm from point cloud that are generated via practical computer vision algorithm. The 2-D images provided in the EPFL database are of extremely high resolutions. To reduce the required computation, we first down-sample the images of the fountain model for 4 times to the resolution of These photos are taken from 11 views for the Fountain model. On these images, the depth maps of the fountain model are generated based on the methods introduced in Section III-A1. The noises and outliers in the extracted point cloud are pre-removed by physical filtering (including geometry constraint and epipolar constraint) strategies introduced in Subsection III-A3. The resulting rough point clouds are shown in Fig. 1(a) and their quantitative properties are reported in the first row of Table III. The most straightforward method to fuse these point clouds is based on K-means clustering [19]. It selects the center of k-nearest points as the fused point. The K-means clustering result is shown in Fig. 1(b). Here, we choose to guarantee that the quantity of points after k-means fusion is similar to the quantity of point recovered by matrix based fusion, which equal to the number of rows of the matrix. For matrix-completion-based framework, we compare average method, TMC, -comp. and LPC. Unfortunately, TMC does not converge to a meaningful solution on these practical depth maps. The corresponding fusion results by different methods are provided in Fig. 1. The parameters of different point cloud fusion algorithms and their corresponding fusion accuracy are provided in Table III. The number of points after fusion is reported in the #Points column in Table V. It is observed that both the K-means and our matrix fusion methods are effective to reduce the quantity of the rough points. However, the power of the matrix fusion approach is especially highlighted on the fusion accuracy. From the visual effect comparison, the fusion results by matrix completion generally outperform the K-means fusion. With respect to Fig. 1(c) and (d), the two results are obtained on the same incomplete matrix. However, they are completed by different methods. The fusion result by LPC has less outliers and errors on the surface. Moreover, from the quantitative evaluation, LPC increases the fusion accuracy for about 33% over average fusion. These evaluations verify that even on the same incomplete matrix, the LPC method could achieve much higher completion and fusion accuracy rather than the naive average completion method. We do not provide the -comp. result in Fig. 1 since the visual effect of -comp. is quite similar to the LPC result. However, LPC could gain higher fusion accuracy as indicated in Table III. This is because that -comp. can be regarded as a special case of LPC model by setting all the known entries in the weight matrix to be one. Therefore, the fusion results by these two methods have similar visual appearance. However, LPC improves over the -comp. method owing to the use of the weighting matrix, which reduces the impact of grossly erroneous entries on the recovered matrix. 2) Convergence Analysis: In this subsection, we discuss the convergence and speed of the proposed LPC algorithm. We show the convergence of LPC for a sub-fusion matrix from Fountain P11 data of size. The LPC algorithm involves both inner and outer iterations. We have already discussed the convergence criterions for both inner and outer

DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 577 TABLE IV COMPARISONS OF THE FUSION RESULTS ON THE RECTIFIED AND NON-RECTIFIED DEPTH MAPS Fig. 6.

12 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 577 TABLE IV COMPARISONS OF THE FUSION RESULTS ON THE RECTIFIED AND NON-RECTIFIED DEPTH MAPS Fig. 6. Convergence speed of LPC for with Lagrangian parameter increasing. iterations in the last paragraph of Section IV-A. For most problems examined in this work, the outer convergence can be achieved with three iterations (under the convergence criterion of stated earlier). The inner convergence rates with the increasing of dual variables (i.e., Lagrangian Multiplier ) are shown in Fig. 6, where abscissa axis records the counts of inner iterations, which also represents the ascending direction of Lagrangian dual. From the results, we know that with the increasing of Lagrangian multiplier, each inner loop converges in limited iterations. It is worth noting that the first outer iteration needs maximal inner loops and the inner convergence is accelerated in the second and third outer iteration due to the penalty effect of the weight matrix. Using the proposed LPC algorithm, we can get the fusion result with more than rows (i.e., Fountain-P11 fusion in Section V-D) within 7 minutes in a computer with a 2.4-GHz CPU that has 4 cores and 16-GB memory. It is believed that the convergence speed can be further accelerated by some recent emerging fast algorithms for nuclear norm minimization [47], [48]. In a nutshell, these methods try to linearize the quadratic term in the Augmented Lagrangian Multiplier equation [8] and add a proximal term to accelerate the convergence of ADMtype problems. Such linearized ADM shows promising results on the tasks of nuclear norm minimization and on the low rank representation [49]. We will consider generalizing the power of such linearized ADM paradigm to speed up LPC in our future works. E. Robustness Verifications 1) Simulations on Occluded Pixels: Since self-occlusions and holes are very difficult to be processed for MVS, in this part, we discuss the robustness of the proposed framework to handle occlusions. As discussed in Subsection III-A2, we estimate an occlusion map as part of the disparity estimation process and exclude occluded pixels in the energy function for computing the disparity field. Although the occlusion map increases the estimation accuracy of the disparity fields and consequently the depth maps [40], it also causes the loss of depth information at occluded positions. However, since these occlusions only occupy a small portion of the total point clouds, the missing information does not significantly affect the final fusion result. In order to justify this point, we do a simple simulation where we randomly eliminate 5% points in each view of the Fountain P11 model which causes the total points decreases from 3.02 million to 2.85 million. Then, following the same steps introduced previously, we construct the fusion matrix for these points using m. After LPC, we get about 0.51 million fusion points in total with a fusion accuracy about 1.31 cm, which is very close to the fusion result without randomly eliminating points. This simulation demonstrates the robustness of the proposed algorithm to handle a small portion of pixels whose depth values are missing. For the occluded pixels, although we do not know their values from a certain view, they are still visible from other views and their depth values can be inferred via low rank completion. Another artifacts on 3-D model are the holes. Holes refer to areas on the reconstructed surface that cannot be seen by any camera. The proposed matrix-completion-based framework relies on the depth information from multiple views to infer the true depth value. Accordingly, it cannot handle holes for which no information is available. However, small holes on the reconstructed 3-D surface can be filled by the postprocessing procedures, e.g., with Poisson reconstruction method [50]. 2) Fusion From Non-Rectified Depth Maps: Previously, all the fusion experiments are conducted on the rectified depth maps which have been processed by physical filtering steps introduced in Subsection III-A3. In this subsection, we will show that the proposed framework is also capable to handle non-rectified depth maps directly obtained by our disparity estimation algorithm. In order to generalize the fusion ability to non-rectified data, two minor changes are desired in matrix construction and LPC procedures, respectively. In the step of matrix construction, we should use a larger value for the parameter for the non-rectified data. Recall that serves as a threshold to determine the maximal distance allowed between two points in the same row. For the rectified data, we use m and for the non-rectified depth maps, we set m to account for larger noises. Besides, after matrix construction, we abandon the rows that only have one observed entry in one row. This is because, in most cases, one point on the 3-D object can be seen by multiple cameras. Those sole observed points in one row are likely to be falsely recovered points far from the actual object surface. Such one-point-per-row case hardly happens on rectified depth maps because such points would have been removed by physical filtering. The number

13 578 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 TABLE V DETAILS ABOUT THE FUSION MATRIX (MIDDLEBURY) eter can be set as and in general, the result is acceptable when. In Table IV, Acc.(R) and Acc.(F), respectively, report the fusion accuracy of rough point cloud and of the fused point cloud, from which we know that LPC could also significantly improve the fusion accuracy from non-rectified data. To summarize, the processing on non-rectified data only needs two minor changes, i.e., enlarge and decrease. From the Noise removal term in Table IV, it is apparent that LPC remove much more noises from non-rectified depth maps than from rectified data, which serves to demonstrate the effectiveness and robustness of the proposed framework. Fig. 7. Fusion results with different. of non-rectified points and the number of fused points (row number) are reported in Table IV. In addition to using a larger, for LPC to complete, we decrease the parameter in (11) for non-rectified data. This is because controls the balance between the rank minimization and noise sparseness. The larger is, the less noises are removed from the matrix and vice versa. In [3], the authors suggest use. Here, we set, vary in the range of [0.5, 1.5] and examine how affects the rank, noise removal and the fusion accuracy, using the non-rectified data. In Fig. 7, the top subfigure reports the rank of the complete matrices of three dimensions after LPC. It is obvious that a large makes the final completed matrix exhibit high rank. The middle subfigure records the noise portion removed from the observed entries, i.e.,. From it, we know that the larger is, the less noises are removed from the observed entries. However, from the bottom subfigure, which records the fusion accuracy, it is apparent that neither a too large or a too small is good. This is because when is too small, too much noise is removed from matrix which may contain some useful information. Albeit we get a rank one matrix after LPC, the result is not acceptable because of the over-removal effect. On the contrary, if is too large, LPC may leave too much noise on the completed matrix which causes a higher rank of the completed. Accordingly, an ideal should be the largest one that exactly makes a rank-one completion. Unfortunately, in practice, it is impossible to exhaustively search for such a. Therefore, we relax the strict rank one constraint and believe that the ideal fusion result is achieved if the completed matrix has a relative low rank, e.g., rank 2 or 3; and meanwhile, the ratio of its first and second singular values are larger than For the non-rectified depth maps here, the optimal param- VI. MVS RECONSTRUCTIONS In this section, we will provide the results on MVS reconstruction using the point cloud fused by matrix completion approaches. We will evaluate the reconstructions on the standard Middlebury dataset [51], which is a benchmark platform for evaluating different MVS algorithms. For rendering purpose, the fused point clouds were meshed using Poisson Reconstruction [50]. From the data in Table V, it is obviously that the matrix fusion algorithm could greatly reduce the redundancy of the rough point cloud. For comparison purpose, we will compare our matrix completion-based reconstructions (MCR) with other state-of-the-art reconstruction algorithms. As stated in Section II-A, there are four mainstream methods for MVS reconstruction. With respect to the volumetric, surface and feature point based reconstructions, we report the one that achieves the best performance among their corresponding categories. Since our reconstruction relies on the depth-map-based method, we will compare two representative depth-map-based reconstruction methods. The comparisons are reported in Table VI, where we use Acc. to denote the reconstruction accuracy (cm) and Comp. to denote the completeness, respectively. The time cost is recorded in minutes. In the table, entries mean that the candidate dose not report their results on the reconstruction. From the results, we can conclude that Furukawa s feature point based reconstruction algorithm achieves the best accuracy among all the reconstruction algorithms although it needs the heaviest computational cost. The time costs of their method are about 10 times of our method. For the depth-map-based fusion methods, Bradley s method achieves higher accuracy but has lower scores in completeness. Zach s method is the fastest one, but the reconstruction accuracy is relatively low. For visual comparison, we provide the ground truth and our reconstructions in Fig. 8. The Bradley s reconstructions in [25] are selected for comparison since it provides high accuracy reconstructions on four models. From the visual comparisons, our reconstruction

DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 579 TABLE VI QUANTITATIVE EVALUATIONS ON THE BENCHMARK MIDDLEBURY DATASETS Fig. 8. Reconstructions on Middlebury datasets.

method provides more complete surfaces. There are obvious holes in the DinoSparse and TempleSparse model reconstructed by Bradley s method.

14 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 579 TABLE VI QUANTITATIVE EVALUATIONS ON THE BENCHMARK MIDDLEBURY DATASETS Fig. 8. Reconstructions on Middlebury datasets. Below the subfigures, the first number means the accuracy and the second number denote the completeness. (a) Dino Sparse Ring, (b) Dino Ring, (c) Temple Sparse Ring, and (d) Temple Ring. method provides more complete surfaces. There are obvious holes in the DinoSparse and TempleSparse model reconstructed by Bradley s method. We would like to emphasize that the contribution of this paper is not for the whole framework of MVS reconstruction. The novelty of this work is mainly for fusion of noisy depth maps. The MVS reconstructions involves multiple steps including depth information extraction, fusion, and meshing. The MVS reconstructions reported above are based on different methods in each step. With respect to the depth maps-based reconstructions, each competitor uses different methods to generate and fuse the depth information. Besides, the point-to-mesh conversion algorithms are also different. Any step may affect the final reconstruction accuracy. The MVS discussions here only serve to demonstrate that the proposed framework using matrix completion for depth map fusion is competitive with other state-of-the-art reconstruction algorithms in terms of both the reconstruction quality and computational cost. VII. CONCLUSIONS AND DISCUSSIONS This paper takes the advantages of compressed sensing and matrix completion to fuse the noisy point clouds for multiview stereo. The proposed framework using matrix completion solves a traditional computer vision problem from a fresh perspective. Besides, we propose a new formulation for matrix completion, called LPC, which achieves remarkable improvements in removing noises especially outliers in the rough point clouds. Compared to the PCP method, which minimizes the -norm of the outliers, the LPC method minimizes the log-sum of the outliers, which is a closer approximation to the -norm. Its power is not only restricted to the fusion task discussed in this paper. Rather, it can be applied to other practical applications of matrix completion with sparse outliers. Also, there are some challenges that deserve future research. One critical problem that may hinder the application of the proposed method for practical applications is how to deal with extremely large point clouds. Although the proposed algorithm is

15 580 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 efficient and effective enough to fuse the depth maps with millions of points, the ability in treating large scale fusion problems is still limited. The bottleneck is mainly due to the matrix completion algorithm, which requires huge memory space to complete a very large matrix. Fortunately, to our knowledge, many efforts are now devoted to the problem of large scale matrix completion by the researchers in the field of applied mathematics, signal processing and machine learning. We hope and believe that such bottleneck will be broken in the not too far future. APPENDIX A POINT CLOUD ACCURACY EVALUATION STRATEGY In this Appendix, we describe how to evaluate the accuracy of a reconstructed point cloud,, compared to a ground truth model provided in the form of a triangular surface mesh. We evaluate the point cloud accuracy by calculating the distance of each point in to its closest triangular patch surface. The most straightforward method is to project each point to all the triangles in the ground truth model and then return the least distance. Such method sounds reasonable but is impossible for practical implementation due to the intractable computational complexity, when the number of triangles is very large (e.g., 26 million triangles in the Fountain P11 model). Therefore, in this paper, we will introduce an efficient and also accurate evaluation strategy. Let denote the set of all the vertices in the ground truth model. During the evaluation processes, when inputting one point, we first return nearest vertices in the vertex set. Then, it is possible to get all the triangles that are connected to these vertices. A connected triangle is a triangle that has at least one vertex in the nearest vertex set. When is sufficiently large, this set of triangles is guaranteed to contain the one that is closest to the input point. We denote the triangular set that shares the vertices in as. It is computationally more efficient to just project the input point to the triangles in than all the triangles. The pseudo code for this evaluation strategy is provided in Algorithm 3. In Algorithm 3, there are three functions in lines 3, 4, and 5, respectively. The method for finding the connected triangles with a particular vertex is straightforward and we omit here. We will provide details on the strategies of finding the nearest vertex (i.e., in line 3) and calculating the least projection distance (i.e., in line 5). In order to find the nearest vertices of a point, we first grid the ground truth model. For a query point, it is simple to find which cube in the grid it falls in and the nearest vertices are searched in this cube and its neighboring cubes. The calculation of the projection distance from one point to a flat surface is also straightforward. However, we will discuss one special case where the projection may fall out of the scope of the triangular patch. In such cases, we use the least distance between the point and the three vertices of the triangles for substitution. ACKNOWLEDGMENT The authors would like to thank D. Scharstein for his help on evaluating our Dino and Temple reconstruction results. They would also like to thank Y. Qian and K. Li for their useful discussions. REFERENCES [1] S. Ma, D. Goldfarb, and L. Chen, Fixed point and bregman iterative methods for matrix rank minimization, Math. Program. vol. 128, pp , 2011 [Online]. Available: [2] E. Candès and Y. Plan, Matrix completion with noise, Proc. IEEE, vol. 98, no. 6, pp , Jun [3] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, J. ACM, vol. 59, no. 3, pp. 1 37, May [4] M. Fazel, Matrix rank minimization with applications, Ph.D. dissertation, Stanford Univ., Stanford, CA, Mar [5] E. J. Candès, M. Wakin, and S. Boyd, Enhancing sparsity by reweighted ` minimization, J. Fourier Anal. Appl., pp [6] K. Lange, D. R. Hunter, and I. Yang, Optimization transfer using surrogate objective functions, J. Comput. Graph. Statist., vol. 9, pp. 1 59, [7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn., vol. 3, no. 1, pp , [8] Z. Lin, M. Chen, and Y. Ma, The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices, Tech. Rep. arxiv: v2, Mar [9] Y. Deng, Q. Dai, R. Liu, Z. Zhang, and S. Hu, Low-rank structure learning via log-sum heuristic recovery, 2012, Preprint in arxiv [10] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, A comparison and evaluation of multi-view stereo reconstruction algorithms, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn., Jun. 2006, vol. 1, pp [11] V. Kolmogorov and R. Zabih, Multi-camera scene reconstruction via graph cuts, in Eur. Conf. Comput. Vis., 2002, pp [12] G. Vogiatzis, P. Torr, and R. Cipolla, Multi-view stereo via volumetric graph-cuts, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn. CVPR, Jun. 2005, vol. 2, pp , vol. 2. [13] Y. Furukawa and J. Ponce, Accurate, dense, and robust multiview stereopsis, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp , Aug [14] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S. Seitz, Multiview stereo for community photo collections, in Proc. IEEE 11th Int. Conf. Comput. Vis. ICCV 07, Oct. 2007, pp [15] M. Habbecke and L. Kobbelt, A surface-growing approach to multiview stereo reconstruction, in Proc. IEEE Conf. Comput. Vis. Pattern Recogn. CVPR 07, Jun. 2007, pp. 1 8.

DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 581 [16] J.-P. Pons, R. Keriven, and O. Faugeras, Modelling dynamic scenes by registering multi-view image sequences, in Proc.

Horaud, Topology-adaptive mesh deformation for surface evolution, morphing, and multiview reconstruction, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 4, pp. 823 837, Apr. 2011. [18] Y.

Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J. M. Frahm, R. Yang, D. Nister, and M. Pollefeys, Real-time visibility-based fusion of depth maps, in Proc. IEEE 11th Int. Conf. Comput. Vis.

16 DENG et al.: NOISY DEPTH MAPS FUSION FOR MVS VIA MATRIX COMPLETION 581 [16] J.-P. Pons, R. Keriven, and O. Faugeras, Modelling dynamic scenes by registering multi-view image sequences, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recogn. CVPR 05, Jun. 2005, vol. 2, pp , vol. 2. [17] A. Zaharescu, E. Boyer, and R. Horaud, Topology-adaptive mesh deformation for surface evolution, morphing, and multiview reconstruction, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 4, pp , Apr [18] Y. Liu, Q. Dai, and W. Xu, A point cloud based multi-view stereo algorithm for free-viewpoint video, IEEE Trans. Visualiz. Comput. Graphics, vol. 16, no. 3, pp , May-Jun [19] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J. M. Frahm, R. Yang, D. Nister, and M. Pollefeys, Real-time visibility-based fusion of depth maps, in Proc. IEEE 11th Int. Conf. Comput. Vis. ICCV 07, 2007, vol. 14, no. 21, pp [20] M. Goesele, B. Curless, and S. M. Seitz, Multi-view stereo revisited, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn. Vol. 2 Ser. CVPR 06, Washington, DC, 2006, pp [Online]. Available: [21] J. Li, E. Li, Y. Chen, L. Xu, and Y. Zhang, Bundled depth-map merging for multi-view stereo, in Proc. IEEE Conf. Comput. Vis. Pattern Recogn. (CVPR), Jun. 2010, pp [22] N. Campbell, G. Vogiatzis, C. Hernandez, and R. Cipolla, Using multiple hypotheses to improve depth-maps for multi-view stereo, in Proc. Eur. Conf. Comput. Vis., 2008, vol. 1, pp [23] P. Gargallo and P. Sturm, Bayesian 3d modeling from images using multiple depth maps, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn. CVPR 05, 2005, vol. 2, pp [24] Y. Liu, X. Cao, Q. Dai, and W. Xu, Continuous depth estimation for multi-view stereo, in Proc. IEEE Conf. Comput. Vis. Pattern Recogn. CVPR 09, Jun. 2009, pp [25] D. Bradley, T. Boubekeur, and W. Heidrich, Accurate multi-view reconstruction using robust binocular stereo and surface meshing, in Proc. IEEE Conf. Comput. Vis. Pattern Recogn, CVPR 2008, Jun. 2008, pp [26] C. Zach, Fast and high quality fusion of depth maps, in Proc. Int. Symp. 3D Data Process., Visualiz. Transmiss. (3DPVT), [27] J. Rennie and N. Srebro, Fast maximum margin matrix factorization for collaborative prediction, in Proc. 22nd Int. Conf. Mach. Learn., ACM, 2005, pp [28] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math., pp , [29] K. Toh and S. Yun, An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems, Pacific J. Optimiz., vol. 6, no , p. 15, [30] Z. Zhou, X. Li, J. Wright, E. J. Candès, and Y. Ma, Stable principal component pursuit, in Proc. Int. Symp. Inf. Theory, Jun [31] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry, Using collaborative filtering to weave an information tapestry, Commun. ACM, vol. 35, no. 12, pp , [32] P. Biswas, T. C. Lian, T. C. Wang, and Y. Ye, Semidefinite programming based algorithms for sensor network localization, ACM Trans. Sens. Netw., vol. 2, no. 2, pp , [33] K. Mohan and M. Fazel, Reweighted nuclear norm minimization with application to system identification, in Proc. Amer. Control Conf., [34] J. Wang, Y. Dong, X. Tong, Z. Lin, and B. Guo, Kernel nystrm method for light transport, ACM Trans. Graph. vol. 28, pp. 29:1 29:10, Jul [Online]. Available: [35] R. Garg, H. Du, S. Seitz, and N. Snavely, The dimensionality of scene appearance, in Proc. IEEE 12th Int. Conf. Comput. Vis., Oct , pp [36] Y. Deng, Q. Dai, and Z. Zhang, Graph laplace for partially occluded face completion and recognition, IEEE Trans. Image Process., vol. 20, no. 8, pp , Aug [37] Y. Peng, A. Ganesh, J. Wright, and Y. Ma, RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images, in Proc. CVPR, [38] C. Fu, X. Ji, and Q. Dai, Adaptive compressed sensing recovery utilizing the property of signals autocorrelations, IEEE Trans. Image Process., vol. 21, no. 5, pp , May [39] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping. New York: Springer, 2004, pp [40] F. Huguet and F. Devernay, A variational method for scene flow estimation from stereo sequences, in Proc. IEEE 11th Int. Conf. Comput. Vis. ICCV 07, Oct. 2007, pp [41] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. R. Statist. Soc. B, vol. 67, pp , [42] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems., SIAM J. Imag. Sci., vol. 2, no. 1, pp [43] M. Lees, Iterative reweighted least squares for matrix rank minimization, Math. Comput., vol. 16, no. 77, pp , [44] S. Ma, D. Goldfarb, and L. Chen, Fixed point and bregman iterative methods for matrix rank minimization, Math. Program., vol. 128, no. 1, pp , [45] [Online]. Available: [46] C. Strecha, W. von Hansen, L. V. Gool, and U. T. P. Fua, On benchmarking camera calibration and multi-view stereo for high resolution imagery, in Proc. IEEE Conf. Comput. Vis. Pattern Recogn. CVPR 09, 2008, pp [47] Z. Lin, R. Liu, and Z. Su, Linearized alternating direction method with adaptive penalty for low-rank representation, in Proc. Adv. Neural Inf. Process. Syst., [48] J. Yang and X. Yuan, Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization, Math. Comp., 2012, to be published. [49] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell., 2012, to be published. [50] M. Kazhdan, M. Bolitho, and H. Hoppe, Poisson surface reconstruction, in Proc. 4th Eurographics Symp. Geometry Process., Ser. SGP 06, Aire-la-Ville, Switzerland, 2006, pp [Online]. Available: Switzerland: Eurographics Association. [51] [Online]. Available: in Yue Deng received the B.E. degree (with honors) in automatic control from Southeast University, Nanjing, China, in He is currently working towards the Ph.D. degree in the Department of Automation, Tsinghua University, Beijing, China. From September 2010 to September 2011, he was a Visiting Scholar in the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. His current research interests include signal processing, computer vision, and machine learning. Mr. Deng was a recipient of Microsoft fellowship Yebin Liu received the B.E. degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2002, and the Ph.D. degree from the Department of Automation, Tsinghua University, Beijing, in He was with the Max Plank Institute of Informatics in Since 2011, he has been an Assistant Professor in the Automation Department, Tsinghua University. His research interests include image-based modeling and rendering, markerless motion capture, and vision-based graphics applications. Qionghai Dai (SM 05) received the B.S. degree in mathematics from Shanxi Normal University, Xian, China, in 1987, and the M.E. and Ph.D. degrees in computer science and automation from Northeastern University, Shenyang, China, in 1994 and 1996, respectively. He has been on the faculty of Tsinghua University, Beijing, China, since He is now a Cheung Kong Professor of Tsinghua University and is the Director of the Broadband Networks and Digital Media Laboratory. His current research interests include signal processing, computer vision and graphics.

582 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 Zengke Zhang received the B.S. degree in industrial electrization and automation from Tsinghua University, Beijing, China, in 1970.

Yao Wang (M 90 SM 98 F 04) received the B.S. and M.S. degrees in electronic engineering from Tsinghua University, Beijing, China, in 1983 and 1985, respectively, and the Ph.D.

17 582 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 6, NO. 5, SEPTEMBER 2012 Zengke Zhang received the B.S. degree in industrial electrization and automation from Tsinghua University, Beijing, China, in He is a Professor in the Department of Automation, Tsinghua University. He has a wide research scope, including intelligent control motion control, system integration, and image processing. Yao Wang (M 90 SM 98 F 04) received the B.S. and M.S. degrees in electronic engineering from Tsinghua University, Beijing, China, in 1983 and 1985, respectively, and the Ph.D. degree in electrical and computer engineering from the University of California at Santa Barbara in Since 1990, she has been with the Electrical and Computer Engineering Faculty, Polytechnic University, Brooklyn, NY (now Polytechnic Institute of New York University). Her research interests include video coding and networked video applications, medical image processing, and pattern recognition. She is the leading author of the textbook Video Processing and Communications (Prentice-Hall, 2002). Dr. Wang has served as an Associate Editor for the IEEE TRANSACTIONS ON MULTIMEDIA and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY. She received the New York City Mayor s Award for Excellence in Science and Technology in the Young Investigator Category in year She was elected Fellow of the IEEE in 2004 for contributions to video processing and communications. She is a coauthor of two IEEE Communications Society best papers: the Leonard G. Abraham Prize Paper in the Field of Communications Systems in 2004, and the Multimedia Communication Technical Committee Best Paper in She was a keynote speaker at the 2010 International Packet Video Workshop. She received the Overseas Outstanding Young Investigator Award from the Natural Science Foundation of China in 2005 and was named Yangtze River Lecture Scholar in Tsinghua University by the Ministry of Education of China in 2007.

Noisy Depth Maps Fusion for Multi-view Stereo via Matrix Completion

1 Noisy Depth Maps Fusion for Multi-view Stereo via Matrix Completion Yue Deng, Yebin Liu, Qionghai Dai, Senior Member, IEEE, Zengke Zhang and Yao Wang, Fellow, IEEE Abstract This paper introduces a general