A Fast Region-Level 3D-Warping Method for Depth-Image-Based Rendering

Size: px

Start display at page:

Download "A Fast Region-Level 3D-Warping Method for Depth-Image-Based Rendering"

Britney Stevens
5 years ago
Views:

1 A Fast Region-Level 3D-Warping Method for Depth-Image-Based Rendering Jian Jin #1, Anhong Wang *2, Yao hao #3, Chunyu Lin #4 # Beijing Key Laboratory of Advanced Information Science and Network Technology, Institute of Information Science, Beijing Jiao tong University, China 1 JianJin@bjtu.edu 3 yzhao@bjtu.edu.cn (corresponding author) 4 cylin@bjtu.edu.cn * Institute of Digital Media & Communication, Taiyuan University of Science and Technology, China 2 wah_ty@163.com Abstract In 3D video, depth-image-based rendering (DIBR) is widely employed in view synthesis to generate virtual views. owever, the processing of this algorithm is based on the framelevel, and the characteristics in different regions cannot be fully taken into account before rendering. This drawback will lead to the unnecessary and redundant information in some regions being abused, which increases extra computation. This paper proposes a region-level 3D-warping method for DIBR, where regions are divided according to their characteristics. Then, only the necessary information in some important regions is utilized warping so that the redundant information could be avoided in the computation. Experimental results show that our approach is almost 4 times faster than VSRS-1D-fast, while declines 0.12 db PSNR in the performance of synthesis views averagely. ence, our method can achieve a good trade-off between the computation and view synthesis and will be especially useful for applications where the computation is the concern. I. INTRODUCTION 3D video (3DV) is getting increasingly popular since it offers us an immersive experience. In 3DV, view-synthesis is an important technology, which exploits the Depth-Image- Based Rending (DIBR) to generate the virtual views [1]. The core of DIBR is to map pixels from the known original views to the unknown virtual views by utilizing the depth information and the camera parameters. Two main kinds of DIBR approaches have been proposed: single-view approach and multi-view approach [2]. An example of the single-view approach is proposed by Oliveira et al. [3], which utilizes a single original view plus its depth information to synthesize the virtual views. owever, this approach has a drawback in that it often creates large holes in the virtual views due to the lack of corresponding information of the background, which is often occluded by the foreground objects in the original view. An example of the multi-view approach is proposed by Min et al. [4], which uses multiple views from both sides of the virtual views to perform the warping, so that the holes can be filled using other views. This approach is further developed into software, called View Synthesis Reference Software (VSRS) [5]. owever, VSRS is hardly qualified in terms of some applications, where the computation is a concern. For this reason, many approaches have been proposed to speed up the VSRS such as [6] and [7]. They all focus on 3D-warping, since it is the most time-consuming operation in VSRS. The most representative one is proposed by Vijayanagar [8] which is fast and gives good objective quality. Further, it has been developed into the latest View Synthesis Reference Software 1D Fast (VSRS-1D-fast) algorithm [9] as variant of VSRS. This algorithm is optimized for the 1-D parallel model in which the virtual view is aligned vertically with the original views, just like the 1D mode of VSRS. Thus, only the horizontal disparity needs to be considered. Compared to the conventional VSRS, it calculates the disparity by utilizing a separable look-up table-based technique that considerably speeds up the 3D-warping process. Additionally, a novel blending approach based on similarity maps can further improves the quality of the virtual views [10]. owever, VSRS-1D-fast still works on the frame-level just like [6] and [7]. In these frame-level algorithms, pixels in the whole frame will be used rendering, including some unnecessary pixels, which can hardly bring any benefits on the final synthesized view. This is the main drawback of frame-level rendering in terms of the computation complexity. For instance, pixels in some regions can be visible in both of these original views. They should be utilized just once during rendering. Instead, they are abused repeatedly. Another case is that some pixels are first visible in the original view, while become invisible in the virtual view for some reasons (like occlusion). They should be avoided warping. On the contrary, they are still warped. Inspired of this, a fast region-level 3D-warping algorithm is proposed so that unnecessary regions pixels will be forbidden abusing and the important regions pixels are protected using. In our study, we choose VSRS-1D-fast (the state-of-the-art) as the benchmark for comparisons. The rest of the paper is organized as follows. Section II gives us the complexity analysis of VSRS-1D-fast. Section III introduces the different regions in the depth images. Section IV proposes the region-level 3D-Warping scheme. In Section V, experiments are conducted to evaluate the proposed method. Finally, Section VI concludes the paper /15/$ IEEE

2 II. COMPLEXITY ANALYSIS OF VSRS The framework of the VSRS-1D-fast consists of two main steps: 1) warping all pixels in an original view to a virtual view based on the depth information, filling the holes, and 2) blending the warped views into one. The framework of VSRS- 1D-fast is shown in Fig. 1. Fig. 1. The framework of VSRS-1D-fast, where TR, TL, DR and DL are texture right, texture left, depth right, and depth left, respectively. In Fig. 1, during the warping stage, there are three steps carrying out together, i.e., warping, interpolation, and holefilling. To speed up the warping process, the traditional 3Dwarping function [11] is replaced. An efficient and simple separable look-up table-based technique is used. owever, there are some drawbacks during this stage. Firstly, all the occlusion, dis-occlusion, etc. [5] are judged by recording the warped position of pixels. That means a part of the memory is used for recording these position. Secondly, there are some pixels are visible in the original view, and they are occluded by other pixels or beyond the virtual view s range. They will be invisible in the virtual view. These pixels are still be warped. Both of these two drawbacks can be further optimized in our method. Using the neighbour pixels depth difference we can predict the occlusion, dis-occlusion, etc. cases before warping. Further, the occlusion pixels will be marked and avoided warping. The details can be referred to Section III. Blending stage is the main process in the VSRS-1D-fast, which merges two warped views into a single virtual view. This stage contains four steps: 1) creating reliability maps, 2) enhancing similarity, 3) combining the warped views into one based on the reliability map, and 4) decimating chroma. The details can be referred to [9]. Generally, two kinds of blending algorithms are included: 1) blend the two warped views by using a linear weighting function, and 2) choose one warped view as the dominant view, and its holes are filled by the other warped view. The main goal of blending is to fill the unreliable (invisible) regions in one warped view using the reliable (visible) region in the other warped view. In this paper, the unreliable regions are regarded as holes. There are two kinds of holes are focused on: 1) dis-occlusion holes, and 2) boundary holes. These two different kinds of holes have been discussed in [12]. Generally speaking, the blending stage can also be regarded as filling the dis-occlusion holes and boundary holes using some regional information in the other original view. owever, since the specific regions may not be able to be located accurately, a frame-level 3D-warping is used in VSRS-1D-fast, meaning that all pixels in the original frames are warped, which causes redundant regional pixels (that are visible in both of the two original views) being warped. In this paper, this disadvantage is handled by locating these specific regions so that all redundant 3D-warping can be avoided and blending stage can thus be saved. From the entire process of the VSRS-1D-fast algorithm, we observed that the traditional frame-level has many drawbacks, since during the rendering process different regions in the original view have different effects on the rendered virtual view. ence, a region-level 3D-warping is proposed in this paper, which considers the regions characteristic and combines blending with warping. The method mainly consists of two steps: 1) region dividing, and 2) region-level 3Dwarping (without blending stage). III. REGION-DIVIDING IN TE DEPT IMAGE To analyse the different regions in the rendered virtual view, a vertical view of the scene and the cameras for a foreground object in front of a flat background plane is shown in Figure 2. View1 and view3 stand for the two original views. View2 is the virtual view which is going to be synthesized by View1 and View3. MN and AB R are foreground and background, respectively. For simplicity, we choose View2 as the midpoint between View1 and View3. Besides, the optic axes of the views are parallel to each other. Among the views, there is only horizontal disparity, which is known as the 1-D parallel arrangement. The distance between View1 and View3 is L. According to the regions in the virtual view, regions in the image plane of View1 and View3 are classified into the following types: Fig. 2. Illustration of different regions in 3D-warping 1) Boundary non-effective region (BO-NER). When synthesizing View2 by View1, we may find that Region AB is beyond the boundary of the virtual view. Therefore, this

3 region plays no role in the virtual view. Thus, in the image plane of View1, the region corresponding to AB has no effect on the virtual view. We call this region BO-NER. Similarly, the region corresponding to KR in View3 is a BO-NER. In the VSRS-1D-fast, the pixels of a BO-NER are still mapped to the virtual view, which requires computation but does not bring any benefit to the synthesized virtual view. 2) Boundary Single Effective Region (BO-SER). For Region BC, which can be seen only by one view (i.e., View1) represented as BO-SER. Similarly, Region JK in View3 is also a BO-SER. In other words, there will be holes if we just use View1 to synthesize View2. Generally, Blending is just to use the BO-SER of one original view to fill the holes generated by the other original view. So, the information in this type of regions is the key to fill the boundary holes. 3) Double Effective Region (DER). For the regions CD, MN, and IJ, they can be obtained from both original views. A region of this type is called a DER. 4) Background single effective region (BA-SER). Region DE is a single effective region in View1, but it appears at the boundary between the foreground and the background. Such a region is defined as a BA-SER. Similarly, Region I is a BA- SER in View3. The characteristic of such regions is similar to the BO-SER. i.e., the information in these regions is the key to fill the dis-occlusion holes which are the main holes in the virtual view. 5) Background non-effective region (BA-NER). Regions EF is the area where the mapping competition occurs, since background EF and a part of foreground MN will be mapped to the same area in the virtual view during the warping process. Furthermore, pixels of the background will be replaced by the pixels of the foreground. We define this region as a BA-NER. Pixels of such area are still mapped in the VSRS-1D-fast. Just like region EF is a BA-NER for View1, G is a BA-NER for View3. IV. 3D-WARPING BASED ON REGION DIVISION Based on the observations in Section III, we can perform different processing to different regions. After all, the pixels of non-effective regions and redundant regions do not need to be warped in order to achieve computation saving. The proposed region-level 3D-warping consists of the following steps. First, a depth image is considered as the dominant view. Second, we extract the foreground boundaries in the dominant view. Third, we divide the regions and calculate their sizes. Finally, we perform effective region warping as will be described in more details in the following sections. A. Selecting One Depth Image as the Dominant View Depth information represents critical spatial information. To divide the images into regions, we start from choosing an original view as the dominant view to perform the warping, and then use the other one to provide the information to fill the holes. Although for many regions, the information is available from both views, since the texture may have slightly different values in different views, it is advantageous to use the information from the dominant view as much as possible to make the value of the texture in the rendered view more consistent. In this paper, we just select one of the original views, e.g., View1 as the dominant view. B. Extracting the Foreground Boundaries To extract the foreground boundaries, we take the difference between the neighbour pixels as discussed in [13]. As shown in Equation (1), ( ) ( ) ( ) ( ) 1, L = d x, y d x 1, y θ F( xy, ) = 1, R = d x+ 1, y d xy, θ (1) 0, otherwise where d(x,y) stands for the depth value of pixel at (x,y). F(x,y) is the boundary function. If F(x,y) equals 1, it indicates there is a left boundary between (x-1,y) and (x,y), and (x,y) belongs to the foreground, (x-1,y) belongs to background. If F(x,y) equals -1, it indicates there is a right boundary point between (x,y) and (x+1,y), and (x,y) belongs to foreground, (x+1,y) belongs to background. Otherwise, F(x,y) equals 0, which indicates there is no boundary appearing between neighbour pixels. θ is a threshold for determining the depth changes to be qualified as a boundary, which has strongly relationship to the sequence content and camera parameters. It is described as follows: 510 θ = 1 1 fx near where near and far are the depth range of the physical scene. L and f x are the baseline and focal length, respectively. C. Calculating the Region Size In Section III, we described the concept of different regions. Now we calculate the region size based on the foreground boundaries. Consider the simple example in Fig.2 where a foreground object is in front of a flat background plane in parallel to the image plane. far Fig. 3. Region-level rendering Denoting the points in the 3D real world as A, B, C,, the corresponding points in the depth image in View1 as a 1, b 1, c 1,, and the corresponding points in the depth image in (2)

4 View2 as a 2, b 2, c 2,, etc. Referring to Fig.3, a commonly used depth formula is shown in Equation (3): A 1 = d( xa, ya) near far far where d(x a,y a ) is the pixel value of the depth image with a range between 0 and 255 whose coordinate is (x a,y a ); A is the physical depth value for pixel value d(x a,y a ). As shown in Fig.3, if we want to generate View2, we only need to warp all the regions from the corresponding regions in the image planes in View1 and View3. We generate View2 starting from left using the information in View1. Since Region AB will not appear in View2, we should skip Region a 1 b 1 in View1. If a1b1 denote the size of Region a 1 b 1 (Similarly, AB denotes the size of Region AB, etc.), for Region AB, we can have Equation (4): f (3) ab 11 x = (4) AB A Since View2 is at the midpoint between View1 and View3, AB is L/2 and the focal length of camera is f x. We can get A from formula (3), then, a1b1 can be calculated: f ( y ) d x, x a1 a1 ab 11= near far far ence, if we have d(x a1,y a1 ), the depth value of the boundary point a 1, we can calculate the size of a 1 b 1. Region a 1 b 1 in View1 will be skipped before warping. Next, we need to calculate b1e1 in View1 so that we can render Region b 2 e 2 in View2 from Region b 1 e 1 in View1. Since point M appears as a left boundary point in View1, we can locate m 1 in View1 easily. Since the triangle MEF is similar to the triangle MV 2 V 1, And so, fx f 1 1 x e 1f 1= EF = F 2 M F f x 1 1 = 2 near far EF (5) L F M = (6) 2 M ( d( xm 1, ym 1) d( xf1, yf1) ) From the point m 1, which is the point in View1 corresponding to the points M and F, and e1f1, we can locate the point e 1 in View1, and use Region b 1 e 1 in View1 to warp Region b 2 e 2 in View2. Next, we need to render Region e 2 h 2 in View2 from Region m 1 n 1 in View1. Since N is a right boundary, we can locate Region m 1 n 1 in View1 and render Region m 1 n 1 into Region e 2 h 2 easily. (7) Next, since Region I is occluded from View1, we need to warp Region h 2 i 2 in View2 from h 3 i 3 in View3. To do this, we first need to locate the point h 3 in View3. To locate the point h 3, we first find I. Since the triangle NI is similar to the triangle NV 2 V 1, and V2V1 = L/2, And so, I L I N = (8) 2 fx f x 1 1 h33 i = I = I 2 N I f x 1 1 = 510 near far N ( d( xn 1, yn 1) d( xi 1, yi 1) ) We can then warp i 1 to i 3 and locate the point h 3 by subtracting h3i3. After that, we can warp Region h 3 i 3 in View3 to h 2 i 2 in View2. Then, we need render Region i 2 j 2 from the information of i 1 j 1 since View1 is the dominant view, although the information is also available from View3 in this case. It should be noted that i 1 j 1 is right next to m 1 n 1 in View1, and the point j 1 is the end of the image plane of View1, thus, the region m 1 n 1 can be processed as a whole in View1, and we need not to confirm the size of Region i 1 j 1. The next step is to find the information of Region j 2 k 2 from Region j 3 k 3 in View3 since the information is not available from View1. Before doing that, we need to locate the point j 3 in View3. To locate the point j 3, we just need to warp the point j 1 to j 3. Then, since JK = L/2, j3k 3 f x = J JK ( j1 yj1) f d x, x = near far far (9) (10) In summary, from the above discussions, we perform the rendering of View2 as follows. We start from the left using View1 to perform the rendering. When we encounter a left boundary or a right boundary, we will use the process discussed above to take care of the left or right boundary pixels until the end of the image plane. Alternatively, we can start the rendering of View2 from right using View3 if we take View3 as the dominant view. The process will just be the opposite of the above process. V. EXPERIMENTAL RESULTS Two groups of experiments are performed to evaluate our proposed approach. One is the comparison of time complexity, the other is the comparison of the synthesized performance. To be more convincing, the first 100 frames of left view (View1) and right view (View3) are used to render the virtual view (View2). ence, the experimental results here (e.g., TABLE I, TABLE II, etc.) are the average results of 100 frames. Furthermore, three different rendering precisions (full-

5 pixel, half-pixel, and quarter-pixel) are carried out in the test. Several standard sequences including GIST s Café [14], I s Book Arrival [15], and Nagoya University s Balloons [16] are used in the simulations. VSRS-1D-fast is performed according to the 3DV common test condition (CTC) [17]. Our method is also carried out on the same experiment platform and conditions. Sequences Views Café 3(2,4) Book Arrival 9(8,10) TABLE I TOTAL TIME COMPARISON Precision Total run time (sec) Ratios 1D-fast proposed Full alf Quarter Full alf Quarter (1D-fast/proposed) Full Balloons alf (1,3) Quarter Average Sequences Views Café 3(2,4) Book Arrival 9(8,10) Balloons 2(1,3) TABLE II STEP-WISE EXECUTION TIME COMPARISON Process stage 1D fast proposed Region dividing / Warping Blending / Region dividing / Warping Blending / Region dividing / Warping Blending / A. Evaluation on Computation Complexity Firstly, an evaluation on computation complexity is tested between VSRS-1D-fast and ours. From the data set in TABLE I, we can get that the average total time in VSRS-1D-fast is almost 4 times to ours. Then, a step-wise execution time comparison is given out in TABLE II. There are two main stages in VSRS-1D-fast as we have reviewed in Section II: 1) warping stage, and 2) blending stage. While, our algorithm consists: 1) region dividing stage (including foreground boundaries extracting and region size calculating), and 2) region-level 3D-warping stage. In our approach, since the ratio of foreground boundaries pixels is negligible in terms of the whole image, the computation complexity of region dividing stage is quite limited. Besides, during 3D-warping stage, pixels in BO-NER and BA-NER are saved to be warped and pixels in DER are warped only once, which quite reduces the burden on 3D-warping task. Further, we also save the judgment of occlusion and dis-occlusion. ence, for the warping stage, the computation complexity of our method is only a quarter of VSRS-1D-fast. Finally, since we have taken the purpose of blending stage into account during region dividing stage, it means the blending stage is also saved. B. Evaluation on Synthesized Performance Then, the evaluation on the performance of view synthesis is given out in this subsection. Firstly, an objective performance (PSNR) is tested between the ground truth and virtual views which are generated by VSRS-1D-fast and our approach respectively. The experimental results are exhibited in TABLE III. Our approach is 0.12dB lower than VSRS-1D-fast algorithm on the average, which is still comparable. Secondly, the subjective comparisons are shown in figure 4. (a), (b), and (c) stand for the ground truth, virtual view generated from VSRS-1D-fast, and ours respectively. All these pictures are the first frames of the sequences with fullpixel rendering precision. owever, there are some inconspicuous artifacts appearing around at foreground in Fig.4 (c), which are highlighted in the red rectangles. That is because our model assumes that background is flat. Generally, the background of real word is complicated. This will cause the regions dividing not extraordinary accurate and affect the performance of virtual view. Overall, the subjective performance is acceptable. In the future work, we will focus on the general background case and further improve the performance of virtual view. Sequences Views Café 3(2,4) Book Arrival 9(8,10) TABLE III OBJECTIVE PERFORMANCE (PSNR) COMPARISON Precision PSNR (db) ΔPSNR 1D-fast proposed Full alf Quarter Full alf Quarter (1D-fast-proposed) Full Balloons alf (1,3) Quarter Average VI. CONCLUSION In this paper, we proposed a fast region-level 3D-warping method for DIBR, which shows comparable performance in subjective and objective quality of synthesized virtual views, and computation time saving. For the computation saving, our method is based on the region division, which makes good use of the characteristics of different regions so as to save the unnecessary 3D-warping and blending. Considering that, our method is competitive, which will be especially useful for applications where the computation is the concern. ACKNOWLEDGMENT The authors would like to thank Prof. Ming-Ting Sun of Washington University for his valuable suggestions and kind help. This work has been supported by National Natural Science Foundation of China (No , No , No , No , No and No ),

supported by Beijing Natural Science Foundation (4154082) and SRFDP (20130009120038), supported by the Fundamental Research Funds for the Central Universities (2015JBM032).

6 supported by Beijing Natural Science Foundation ( ) and SRFDP ( ), supported by the Fundamental Research Funds for the Central Universities (2015JBM032). Cooperative Program of Shanxi Province (No ), Scientific and Technological project of Shanxi Province ( ), Research Project Supported by Shanxi Scholarship Council of China ( ) and Program for New Century Excellent Talent in Universities (NCET ). REFERENCES [1] K. Pulli, Surface Reconstruction and Display from Range and Color Data, Ph.D. dissertation, University of Washington, [2] P. Merkle, A. Smolic, K. Müller, and T. Wiegand, Multi-view Video plus Depth Representation and Coding, Proc. IEEE International Conference on Image Processing (ICIP'07), pp , Sep [3] M.M. Oliveira, G. Bishop, and D. McAllister, Relief texture mapping, Proceedings of the 27th annual conference on computer graphics and interactive techniques, pp , [4] D. Min, D. Kim and K. Sohn, "Virtual view rendering system for 3DTV," In 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, pp , May [5] D. Tian, P. Lai, P. Lopez, and C. Gomila, View synthesis techniques for 3D video, in Proc. SPIE Applications of Digital Image Processing XXXII, vol. 7443, article T, Aug [6] P. K. Tsung, P. C. Lin, L. F. Ding, S. Y. Chien, and L. G. Chen, Single iteration view interpolation for multiview video applications, in Proc. of 3DTV Conference, pp. 1-4, May 4-6, [7] K.. Chen, Reducing computation redundancy for high- efficiency view synthesis, in Proc. of IEEE Int. Symp. on VLSI Design, Automation, and Test (VLSI-DAT), pp. 1-4, April [8] K. R. Vijayanagar, J. Kim, Y. Lee, and J. B. Kim, Efficient view synthesis for multi-view video plus depth, in Proc. of IEEE Int. Conf. on Image Processing, pp , Sep [9] 3D-EVC Test Model1, Doc. JCT3V-A1005_d0, ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, July [10] Analysis of View Synthesis Methods (VSRS 1D fast and VSRS3.5), Doc. JCT3V-B0124, ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Oct [11] Y. Mori, N. Fukushima, T. Fujii, and M. Tanimoto, View generation with 3D warping using depth information for FTV, Signal Processing: Image Communations, vol. 24, no. 1, pp , [12] K. Y. Chen, P. K. Tsung, P. C. Lin,. J. Yang, and L. G. Chen, ybrid motion/depth-oriented inpainting for virtual view synthesis in multiview applications, in Proc. of 3DTV Conference, pp. 1-4, [13] X. Xu, L.M. Po, K.W. Cheung, K.. Ng, K.M. Wong, and C.W. Ting, A foreground biased depth map refinement, in Proceedings of IEEE Conference of Acoustics, Speech, and Signal Processing (ICASSP), pp , Apr [14] Electronics and Telecommunications Research Institute and Gwangju Institute of Science and Technology (April 2008), 3DV Sequences of ETRI and GIST [Online]. Available: ftp:// [15] Fraunhofer einrich ertz Institute (Sept. 2013), 3DV Sequences of I [Online]. Available: ftp://ftp.hhi.de/impeg3dv [16] Nagoya University (March 2008), 3DV Sequences of Nagoya University [Online]. Available: [17] Common test conditions for 3DV experimentation, ISO/IEC JTC1/SC29/WG11 MPEG2012/N12560, Feb Fig. 4. The denoted b and c are generated by VSRS-1D-fast algorithm and ours, respectively, while a stands for the ground truth. The inconspicuous artifacts are highlighted in the red rectangles

View Generation for Free Viewpoint Video System

View Generation for Free Viewpoint Video System Gangyi JIANG 1, Liangzhong FAN 2, Mei YU 1, Feng Shao 1 1 Faculty of Information Science and Engineering, Ningbo University, Ningbo, 315211, China 2 Ningbo