3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework

Size: px

Start display at page:

Download "3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework"

Roberta Jordan
5 years ago
Views:

1 1730 3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework Sung-Yeol Kim, Ji-Ho Cho, Andreas Koschan, Member, IEEE, and Mongi A. Abidi, Member, IEEE Abstract In this paper, we present a new method to generate and serve 3D video represented by video-plus-depth using a time-of-flight (TOF) depth sensor. In practice, depth images captured by the depth sensor have critical problems, such as optical noise, unmatched boundaries with their corresponding color images, and depth flickering artifacts in the temporal domain. In this work, we enhance the noisy depth images by performing a series of processing steps including joint bilateral filtering with inner-edge selection, outer-boundary refinement by a robust image matting method, and temporal consistency based on motion estimation. Thereafter, the generated high-quality video-plus-depth is combined with computer graphics models in the MPEG-4 multimedia framework. Finally, the immersive video content is streamed to consumers to enjoy 3D view. Experimental results show that our method can minimize the inherent problems of depth images significantly and serve 3D video successfully in the MPEG-4 multimedia framework 1. Index Terms 3D video generation and service, Time-offlight depth sensor, MPEG-4 multimedia framework. I. INTRODUCTION As immersive multimedia services are expected to be available in the near future through a high-speed optical network, three-dimensional (3D) video is recognized as an essential part of the next-generation multimedia application. As a 3D video representation, it is widely accepted that an image sequence of synchronized color and depth images, which is often called as video-plus-depth [1], provides the basis for future 3D video applications. For a practical service of video-plus-depth to the potential consumers of 3D video applications, such as 3D TV [2], we need to consider two important questions: 1) how can we obtain high-quality video-plus-depth? 2) how can we stream 3D video contents including video-plus-depth and interactive multimedia data? With respect to the first question, a variety of depth estimation algorithms have been presented in the fields of computer vision and image processing [3]. However, accurate measurement of depth information from a natural scene still remains problematic due to the difficulty of depth estimation on textureless and depth discontinuous regions. 1 This work was supported by DOE-URPR (Grant-DOE-DEFG02-86NE37968) and US Air Force Grant (FA ) in USA, and in part by the National Research Foundation of Korea (NRF D00277). S.-Y. Kim, A. Koschan, and M. A. Abidi are with the Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN, USA ( {sykim, akoschan, abidi}@utk.edu) J.-H. Cho is with the Department of Mechatronics, Gwangju Institute of Science and Technology, Gwangju, South Korea ( jhcho@gist.ac.kr) Contributed Paper Manuscript received 07/02/10 Current version published 09/23/10 Electronic version published 09/30/ /10/$ IEEE With respect to the second question, traditional multimedia frameworks, such as MPEG-1 and MPEG-2 systems, merely deal with efficient coding issues and synchronization between the conventional video and audio, and do not provide 3D video functionalities. Unlike previous audio-visual standards, the MPEG-4 multimedia framework [4] can support streaming data for various media objects, such as computer graphics models and interactive information. However, the framework does not provide the functionality to stream natural 3D video information represented by video-plus-depth. As sensor technologies to obtain distance data advance, we can now capture more accurate per-pixel depth data of a real scene using active time-of-flight (TOF) depth sensors [5, 6]. These TOF depth sensors directly provide color and depth information from a natural scene by integrating an infrared light source with a conventional video sensor. These depth sensors produce more accurate depth data on textureless and depth discontinuous regions than conventional passive depth estimation methods. Figure 1(a) and Figure 1(b) show a color image and its corresponding depth image captured by a TOF depth sensor, respectively. (a) Color image (b) Depth image Fig. 1. A frame of video-plus-depth captured by a TOF depth sensor However, the depth data from this sensor cannot directly be used due to some inherent problems [7]. In order to use the depth data properly, we need to resolve spatial and temporal problems existing in the raw depth images, such as 1) optical noise, 2) unmatched boundaries between a depth image and its corresponding color image, 3) lost depth data on shiny and dark surface, and 4) temporal depth flickering artifacts on stationary objects. Optical noise, as shown in Fig. 2(a), usually occurs inside of objects in a scene as a result of differences in reflectivity of an infrared sensor according to color variation. Moreover, as shown in Fig. 2(b), the depth information is also not registered well with its corresponding color information such as the region of shoulder of a person in Fig. 1. The problem of unmatched boundaries arises because the TOF depth sensor exhibits inaccurate behavior at very close and very far target distances. In addition, as shown in Fig. 2(c), the TOF depth sensor does not capture depth data well on shiny and dark surfaces such as a black hair region, because reflected lights

S.-Y. Kim et al.: 3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework 1731 from these surfaces are very weak or scattered. Especially, as shown in Fig.

Depth image Color image Black Hair Region (a) Optical noise (b) Unmatched boundary (c) Lost depth data 1 st frame 2

Inherent problems of a TOF depth sensor These spatial-temporal inherent problems limit the use of TOF depth sensors in applications involving motion detection and motion tracking.

2 S.-Y. Kim et al.: 3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework 1731 from these surfaces are very weak or scattered. Especially, as shown in Fig. 2(d), these spatial problems cause to generate depth flickering artifacts on a stationary object such as a table region in Fig. 1, in the temporal domain. Depth image Color image Black Hair Region (a) Optical noise (b) Unmatched boundary (c) Lost depth data 1 st frame 2 nd frame 3 rd frame (d) Temporal flickering artifacts Fig. 2. Inherent problems of a TOF depth sensor These spatial-temporal inherent problems limit the use of TOF depth sensors in applications involving motion detection and motion tracking. Our goal of this work is to provide a solution to improve the quality of depth images captured by a TOF depth sensor for the generation of high-quality videoplus-depth and to show that its application can be extended to reconstruct a realistic and dynamic 3D scene. The contributions of this paper are: (a) development of a new method to minimize optical noise in depth images using a newly-designed joint bilateral filter based on selected inner edges, (b) boundary refinement using robust matting and iterative threshold selection, (c) temporal consistency based on motion estimation, and (d) design of a framework to stream video-plus-depth data in the MPEG-4 multimedia framework to provide a practical solution to serve 3D video contents. This paper is organized as follows. In Section II, we present the proposed 3D video service system briefly. Section III explains the generation of high-quality video-plus-depth using a TOF depth sensor, and Section IV describes the MPEG-4 system for video-plus-depth data streaming. After providing experimental results in Section V, we conclude in Section VI. II. SYSTEM ARCHITECTURE A. Related Works During the past years, a variety of solutions have been developed to enhance depth images captured by TOF depth sensors. For minimization of optical noise in the depth image, a method that uses adaptive sampling and Gaussian smoothing was developed [8]. For lost region recovery, a method was proposed to regenerate lost hair region in a human actor using face detection and quadratic Bézier curve [9]. Recently, a hybrid camera system that combines a high-resolution video camera with a TOF depth sensor was introduced to provide high-quality depth images [10, 11]. These previous works have mainly concentrated on handling optical noise on depth images in the spatial domain and focused on generation of a static 3D scene, not a dynamic one. With respect to video-plus-depth services based on a TOF depth sensor, the ATTEST project has shown the possibility of realizing 3D video services with a video-plus-depth [12]. The ATTEST system transmitted video-plus-depth through one channel and then synthesized 3D virtual views from the video-plus-depth using depth-image-based rendering [13]. In a row, the 3DTV project has also developed core technologies related to the future 3D video services [14]. The previous systems for 3D video services were based on the MPEG-2 system to stream video-plus-depth. As a result, they limited not only 3D video content generation with other multimedia data, such as computer graphics models, but also user-friendly interactions, such as free viewpoint changing. In this paper, we present a spatial-temporal enhancement method of depth images captured by a TOF depth sensor to generate a dynamic 3D scene with high-quality video-plusdepth. In addition, we introduce a framework to stream videoplus-depth and various multimedia data at the same time in the MPEG-4 system while supporting free viewpoint changing. B. Proposed System Architecture Figure 3 shows the overall architecture of the proposed 3D video generation and service system. At the sender side, we capture video-plus-depth using a TOF depth sensor. Then, we apply a robust matting algorithm to color images with trimaps generated from depth images and perform an iterative threshold-selection method to compensate depth information with exact object boundaries. Next, joint bilateral filtering with inner-edge selection is applied to depth images to reduce optical noise. After recovering depth information, based on a quadratic Bézier curve [9], in the regions where depth data loss occurred, we perform temporal consistency based on motion estimation. Thereafter, the color and depth images are encoded by a video coder, such as a H.264/AVC coder [15], independently. The compressed video-plus-depth data are spatio-temporally combined with other multimedia, such as audio source and computer graphics models, using the MPEG-4 Binary Format for Scene (BIFS). The MPEG-4 BIFS is a scene descriptor that contains the spatio-temporal relationship between each multimedia object and interactive information [4]. Other multimedia data and the scene description information are also encoded by their coders. Finally, all encoded bitstreams are multiplexed into one bitstream in the MPEG-4 system. At the client side, we decode and extract video-plus-depth, MPEG-4 BIFS, and other multimedia information from the bitstream transmitted through a channel. Then, the MPEG-4 synchronizer distributes video-plus-depth, computer graphics models, and audio source data to a graphic renderer and an audio player, respectively. With transmitted video-plus-depth, we construct 3D surfaces with depth information by applying a mesh triangulation method [16], and the constructed 3D surfaces are overlaid with color information to represent a dynamic 3D scene. Moreover, other multimedia data are combined with the dynamic 3D scene by referring the transmitted scene description information, MPEG-4 BIFS. As a result, we can experience the immersive 3D video contents by a 3D display device and a speaker system.

1732 Data Acquisition Video-plus-depth Generation 3D Video Transmission Color Video Encoder TOF Depth Sensor Color Images Outer Edge Removal Outer Boundary Matching using Robust Matting Inner Edge

Range Weight Loss Region Recovery Scene Description Information 3D Video Display 3D Video Reception 3D Display Device Graphic Renderer Player Mesh Triangulation Texture Mapping Depth images Color

VIDEO-PLUS-DEPTH GENERATION A. Outer Boundary Matching When we render a scene using color and depth images, the color image is used as texture for a scene.

3 1732 Data Acquisition Video-plus-depth Generation 3D Video Transmission Color Video Encoder TOF Depth Sensor Color Images Outer Edge Removal Outer Boundary Matching using Robust Matting Inner Edge Selection Joint Bilateral Filtering Motion Estimation Temporal Consistency Source CG Models Depth Video Encoder Encoder BIFS Encoder MPEG-4 Framework (MUX) Depth Images Iterative Threshold Selection Range Weight Loss Region Recovery Scene Description Information 3D Video Display 3D Video Reception 3D Display Device Graphic Renderer Player Mesh Triangulation Texture Mapping Depth images Color images CG Models MPEG-4 Synchronizer Source CG Models Color Video Decoder Depth Video Decoder Decoder BIFS Decoder MPEG-4 Framework (DEMUX) C H A N N E L Speaker Scene Description Information III. VIDEO-PLUS-DEPTH GENERATION A. Outer Boundary Matching When we render a scene using color and depth images, the color image is used as texture for a scene. As the TOF depth sensor does not catch exact boundary of objects, it causes visual artifacts in 3D video display. In this paper, we find the outer boundary of objects in a scene applying a robust matting algorithm onto the color image. The depth image is used to generate a trimap automatically, which is the input of image matting with the color image. For automatic trimap generation, we first convert a depth image D into a binary image M D by global thresholding. A foreground area T F can be an erosion of M D and a background area T B can be an inversion of dilation of M D. The trimap is represented by inversion of T F T B. Figure 4(b) shows a trimap generated by a depth image in Fig. 4(a). Exact outer boundaries are estimated applying an alpha matting algorithm onto the color image. We employ closed form matting [17] that finds globally optimal alpha matte using a quadratic cost function with a local smoothness assumption on foreground and background colors. Figure 4(c) shows an alpha map generated by closed form matting. The alpha map is used to compensate depth information in a depth image with extracted outer boundary by Eq. 1. Ai Di = Di ( x n, y m) (1) 255 Fig. 3. Proposed 3D video generation and service system where D i (x, is the intensity of pixel position (x, on the i th depth image D i, and A i (x, is the alpha value on the i th alpha map A i. The term of D i (x-n, y-m) is the nearest intensity of D i (x, that is found by a spiral search method. Finally, outer boundary is generated by local iterative threshold selection. We apply a threshold selection on the region of outer boundary. In order to do that, the depth image is partitioned into smaller blocks with block size, and blocks including outer boundary are selected. We find a threshold to separate each selected block into foreground and background regions using iterative average equivalence of normalized histogram. Figure 4(d) shows a compensated depth image by exact outer boundary. (a) Depth image (b) Trimap (c) Alpha map (d) Compensated depth image Fig. 4. Outer boundary matching using robust alpha matting

4 S.-Y. Kim et al.: 3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework 1733 B. Inner Optical Noise Minimization A general bilateral filter reduces noise in an image while preserving important sharp edges [18]. For reducing optical noise inside of objects in depth images, joint bilateral filtering has been used in the previous works [9, 19, 20]. In joint bilateral filtering, the assumption is that regions of depth discontinuity in a depth image usually correspond to the edges in its color image. Formally, for a pixel position p in a depth image D with its corresponding color image C, the joint filtered value J p is represented by Eq. 2 1 J p = G s ( p q ) Gr 1( Cp Cq ) Gr 2( Dp Dq ) Dq (2) kp q Ω where G s is the space weight. G r1 and G r2 are the weights of color difference and depth difference, respectively. The weights are derived from Gaussian distribution. Ω is the spatial support of the weight G s, and K p is a normalizing factor. The term of q is the pixel position in Ω. Note that some edges in a color image are not on the region of depth discontinuities in its depth image. These kinds of edges make undesirable effects on the depth image during joint bilateral filtering [20]. Although original depth information is smooth, these edges on the color image make the depth information to be discontinuous by joint bilateral filtering. In this paper, we present joint bilateral filtering based on valid inner edge selection. Figure 5 shows the procedure for optical noise minimization. addition, we remove edges on the region of outer boundary from the edge map E C, because we already matched the outer boundary in Section 3.1. For outer edge removal from E C, we convert a depth image D into a binary mask image M D by global thresholding. Then, the binary image M D is eroded to define a foreground area T F. The edge map E O after outer edge removal can be calculated by E C T F. Figure 6(d) shows an edge map after outer edge removal. (a) Color image (c) Edges from depth image (b) Edges from color image (d) Outer edge removal Color Image Depth Image (e) Selected inner edges (f) Modified color image Fig. 6. Valid inner edge selection from color and depth images Iterative Gaussian Filtering Canny Edge Detection Outer Edge Removal Edge Labeling Inner Edge Selection Image Merging Global Binarization Erosion Canny Edge Detection Range Weight Joint Bilateral Filtering Color image merged Enhanced Depth Image with edge map Fig. 5. Procedure of optical noise minimization First, we get an edge map E C from the color image C, and another edge map E D from its depth image D using Canny edge detection. Figure 6(b) shows the edge map extracted from a color image in Fig. 6(a). Figure 6(c) shows the edge map extracted from a depth image. Since TOF depth sensors usually do not capture edges on background, we remove them from the edge map E C. In Thereafter, we select valid inner edges in the edge map E O using a traditional labeling method based an equivalent table updating [21]. After assigning a label on each edge in E O, we search the label of which position is located at the same position of a pixel on edges in the edge map E D. Then, we gather the edges assigned with the searched labels from the edge map E O, as shown in Fig. 6(e). Finally, we create a new color image by merging the edge map E O with the color image heavily smoothed by iterative Gaussian filtering, as shown in Fig. 6(f). The new color image is used as the input of joint bilateral filtering to minimize optical noise in the depth image instead of the original color image. Formally, for some pixel position p of a depth image D with the modified color image M, the joint filtered value J p is represented by Eq. 3 1 J p = G s ( p q ) Gr 1( Mp Mq ) Gr 2( Dp Dq ) Dq (3) kp q Ω where G r1 is the weights of intensity difference on the modified color image M. C. Recovery of Lost Depth Data In this work, we employ the previous method using a quadratic Bézier curve to recover the region of lost depth data [9]. The depth recovery algorithm consists of three steps: detection of the lost depth data region, recovery of the boundary, and estimation of the lost depth data.

1734 In the method, a region growing algorithm with multiple seeds is applied to detect the lost depth data regions. Then, it recovers the boundary from the detected regions using boundary tracing.

5 1734 In the method, a region growing algorithm with multiple seeds is applied to detect the lost depth data regions. Then, it recovers the boundary from the detected regions using boundary tracing. Finally, the lost depth data region is filled with depth information interpolated by a quadratic Bézier curve with neighboring depth data on the depth image. Figure 7(b) shows a depth image recovered by the quadratic Bézier curve method from the depth image in Fig. 7(a). (a) Before (b) After Fig. 7. Depth image after lost depth data recovery D. Temporal Consistency Temporal consistency reduces temporal depth flickering artifacts on stationary objects in a scene. For temporal consistency, we first detect the stationary regions. In this work, we estimate the stationary regions of the t th frame color image C t from the (t-1) th frame color image C t-1 using block matching. Block matching predicts the movement of objects in a scene by estimating similarity between blocks in the temporal domain. We use mean absolute difference (MAD) as a similarity measurement. When there is an M N block at the position (k, l) on the t th frame color image C t, the MAD value is calculated between the block and another M N block at the position (k+x, l+ on the (t-1) th frame color image C t-1. A motion vector v t (x, of the block on t th frame color image C t, which is a factor to determine motion existence, is represented by Eq. 4 v arg min MAD(, ) t = k l t In block matching, we assume that the block regions including zero motion vectors in both x- and y- directions are stationary. The motion image M t generated by the motion vector data is represented by Eq. 5 M t 0, = 255, if vt x > 0, vt y > 0 Otherwise where v t (x, x and v t (x, y indicate x- and y- direction motion vectors, respectively. In addition, a stationary region image S t can be extracted from the t th frame depth image D t by Eq. 6 (4) (5) St 1, D t = Dt, if St 1 > 0, St > 0 Otherwise IV. MPEG-4-BASED 3D VIDEO CONTENTS In order to deliver multimedia contents, a multimedia framework is needed. In this work, we direct attention to the MPEG-4 multimedia framework [4] that supports streaming data for various media objects and provides flexible interactivity. The MPEG-4-based scene is built with individual objects that have relationships in space and time. Based on this relationship, the MPEG-4 framework allows us to combine a variety of media objects, such as conventional 2D video, audio source and computer graphics models, with a 3D scene represented by videoplus-depth. The MPEG-4 system defines a scene description, referred to as BIFS. The MPEG-4 BIFS defines how the objects are spatio-temporally combined for presentation. All visible objects in the 3D scene are described within the Shape node in MPEG-4 BIFS [22]. Recently, a node to represent videoplus-depth data in the Shape node has been proposed, which is referred as a DepthMovie node [23]. We employed the DepthMovie node to combine video-plus-depth with other multimedia data. Computer graphics models are described by predefined nodes in MPEG-4 BIFS. The BIFS data including scene description and computer graphics model data are coded by a BIFS encoder provided by the present MPEG-4 system. Finally, the compressed color video, depth video, and MPEG-4 BIFS streams are multiplexed into one MP4 file that is designed to contain the media data of MPEG-4 presentation. The MP4 file can be played from a local hard disk and transmitted to consumers by a streaming server through existing networks. V. EXPERIMENTAL RESULTS We tested the performance of our method using two test image sequences, ACTOR1 and ACTOR2 sequences, obtained from a TOF depth sensor [5, 24]. ACTOR1 and ACTOR2 sequences were composed of 200 and 100 frames, respectively. Figure 8 shows some frames in the ACTOR1 and ACTOR2 sequence. The image resolution of the test image sequence was (7) St = Dt & M t (6) where the operator & indicates the BIT-AND operation. Finally, the enhanced depth image D' t considering temporal consistency is calculated by Eq. 7 (a) ACTOR1 sequence (b) ACTOR2 sequence Fig. 8. Test image sequence

6 S.-Y. Kim et al.: 3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework 1735 Figure 9 shows the result of noise minimization for the 1 st frame of the ACTOR1 and ACTOR2 sequences. In Fig. 9, we enlarged the rectangular regions in the first column and showed them in the second column. In the experiment, we used a joint bilateral 3 3 filter and set the standard deviation of each Gaussian kernel for the weight G s, G r1, and G r2 in Eq. 3 to 3, 0.1, and 0.1, respectively. In addition, conventional bilateral filtering [18] and previous joint bilateral filtering [9, 20] methods were used to compare with our method. As shown in Fig. 9(a), the original depth image included serious optical noise on objects in the scene. When we paid attention to the circle regions in the second column, we could easily notice that the proposed joint bilateral filter with inner edge selection reduced the optical noise efficiently while preserving some important sharp features. In the case of bilateral filtering, the crease of the cloth covering a table in the scene of ACTOR1 was almost disappeared, as shown in Fig. 9(b). In the case of joint bilateral filtering, the crease of the cloth was almost appeared, as shown in Fig. 9(c), but some distortion happened due to the effect of unexpected edges on the color image during joint bilateral filtering. On the other hand, the proposed method maintained the crease of the cloth well and minimized the visual distortion, as shown in Fig. 9(d), because we used only valid edges from the color image during joint bilateral filtering. We could notice the same situation on the woman s hair region in the scene of ACTOR2. (a) Original depth image (b) Bilateral filtering (c) Joint bilateral filtering Color image (d) Joint bilateral filtering with inner edge election Fig. 9. Result of noise minimization Depth image Color image Depth image (a) Color image (b) Original depth image Fig. 10. Result of boundary matching (c) Outer boundary matching

1736 Figure 10 shows the result of outer boundary matching for the 1 st frame of the ACTOR1 and ACTOR2 sequences. The rectangular regions of the color image, as shown in Fig.

10(b), original depth image was not registered well with its color image, such as the region of man s calf in ACTOR1 and the region of woman s finger in ACTOR2. On the other hand, as shown in Fig.

unmatched region with neighboring depth information.

7 1736 Figure 10 shows the result of outer boundary matching for the 1 st frame of the ACTOR1 and ACTOR2 sequences. The rectangular regions of the color image, as shown in Fig. 10(a), are enlarged and shown after we folded them onto its corresponding depth image. As shown in Fig. 10(b), original depth image was not registered well with its color image, such as the region of man s calf in ACTOR1 and the region of woman s finger in ACTOR2. On the other hand, as shown in Fig. 10(c), boundary unmatched problem was minimized because the proposed method using a robust matting algorithm and iterative thresholding selection traced the exact boundary and compensated the unmatched region with neighboring depth information. Figure 11 shows the result of temporal consistency for the 1 st, 10 th, 20 th, 30 th, and 40 th frames of the ACTOR1 sequence when we reconstruct 3D scene from video-plusdepth. The table and the knee of a man in the scene are stationary regions in the temporal domain. As shown in Fig. 11(a), the optical noise and temporal inconsistency in the original depth images caused serious distortions. In the experiment, as shown in Fig. 11(b), although we enhanced the depth image in spatial domain using joint bilateral filtering and boundary matching, some flickering artifacts on the region of the table and the knee still happened. On the other hand, as shown in Fig. 11(c), the proposed method could reduce significantly the spatial optical noise by joint bilateral filtering with edge selection and the temporal distortion by the motion estimation-based temporal consistency. Figure 12(a) and Figure 12(b) shows the results of 3D scene reconstruction from the 1 st, 10 th, 20 th, 30 th, and 40 th frames of ACTOR1 and ACTOR2. We could generate natural and dynamic 3D scenes from the enhanced videoplus-depth successfully using a 3D mesh structure [16]. 1 st frame 10 th frame 20 th frame 30 th frame 40 th frame (a) Original depth image sequence (b) Joint bilateral filtering and boundary matching (c) Joint bilateral filtering with edge selection, boundary matching, and temporal consistency Fig. 11. Result of temporal consistency 1 st frame 10 th frame 20 th frame 30 th frame 40 th frame (a) 3D scene from ACTOR1 sequence (b) 3D scene from ACTOR2 sequence Fig. 12. Reconstruction of dynamic 3D scenes

filtering [8], joint bilateral filtering (BF) [9, 20] and our joint bilateral filtering with edge selection onto noisy depth images.

8 S.-Y. Kim et al.: 3D Video Generation and Service based on a TOF Depth Sensor in MPEG-4 Multimedia Framework 1737 In order to assess the depth accuracy improvement by the proposed method, we applied Gaussian filtering [8], joint bilateral filtering (BF) [9, 20] and our joint bilateral filtering with edge selection onto noisy depth images. The noisy depth images were generated by adding Gaussian noise with a standard deviation of 20 into ground truth data in the Middlebury stereo dataset [3]. Figure 13 shows the result for the Bowling depth image in the Middlebury stereo data. Figure 13(a), Figure 13(b) and Figure 13(c) show the ground truth depth image, its corresponding color image, and artificial noisy depth image of Bowling, respectively. As shown in Fig. 13(f), our method reduced optical noise more significantly on the noisy depth image than Gaussian filtering in Fig. 13(d). Furthermore, our method was less affected by edges on the color image than joint bilateral filtering in Fig. 13(e). Peak signal-to-noise ratio (PSNR) was used as a quality measure based on known ground truth data. Table 1 shows the results of PSNR for other Middlebury test depth images. As shown in Table 1, our method had higher PSNR than the competing methods in this experiment. Figure 14(a) shows the 3D video content played by an MPEG-4 player with ACTOR1. We rendered the 3D scene using video-plus-depth data expressed by a DepthMovie node in the MPEG-4 system successfully. As shown in Fig. 14(a), MPEG-4-based 3D video contents could also display video-plus-depth, computer graphics models and 2D image together unlike MPEG-2-based 3D video contents [12]. Video-plus-depth CG model 2D image (a) 3D video contents (a) Ground truth (b) Color image -15 degree 0 degree (center view) +15 degree (b) Free viewpoint changing Fig. 14. MPEG-4-based 3D video contents (c) Noisy depth image (e) Joint bilateral filtering Test data Fig. 13. Depth quality evaluation (d) Gaussian filtering (f) Our method TABLE 1 DEPTH QUALITY EVALUATION (PSNR) Noisy depth data Gaussian Filtering Joint BF Our method Bowling 22.5 db 29.8 db 30.5 db 31.5 db Cloth 22.3 db 32.9 db 37.1 db 38.6 db Aloe 22.3 db 30.2 db 29.7 db 31.4 db Baby 22.4 db 33.0 db 33.9 db 34.1 db Wood 22.8 db 32.2 db 32.9 db 33.1 db Furthermore, we can change viewpoint freely in the 3D video content, as shown in Fig. 14(b). The 3D scene was viewed successfully when we changed the angle of the viewpoint from +15 degree to -15 degree. However, since the depth information on the side view was not captured in this experiment, heavy view changing caused unexpected results from video-plus-depth data, as shown in Fig. 14(a). In addition, Table 2 shows the results of average computation time for depth image enhancement with the test image sequence. Averagely, 3.47 sec/frame and 2.79 sec/frame for ACTOR1 and ACTOR2 sequences were needed for depth image enhancement in the spatial and temporal domain. We expect to reduce the computation time when we optimize our algorithm of depth image enhancement and employ fast rendering technique that uses graphics processing unit. TABLE 2 COMPUTATION TIME FOR DEPTH IMAGE ENHANCEMENT Processing ACTOR1 ACTOR2 Noise reduction 0.76 sec/frame 0.47 sec/frame Loss region recovery 0.55 sec/frame 0.53 sec/frame Boundary matching 1.45 sec/frame 1.12 sec/frame Temporal consistency 0.71 sec/frame 0.67 sec/frame Total 3.47 sec/frame 2.79 sec/frame

1738 VI. CONCLUSIONS In this paper, we have proposed a new method to enhance depth images captured by a TOF depth sensor spatially and temporally.

Furthermore, we showed the possibility of 3D video services based on video-plus-depth data in the MPEG-4 multimedia framework.

of Asilomar Conference on Signals, Systems and Computers, pp. 1529-1533, 2003. [2] C. Fehn, R. Barré, and S. Pastoor, Interactive 3-D TV- concepts and key technologies, Proceedings of the IEEE, vol.

9 1738 VI. CONCLUSIONS In this paper, we have proposed a new method to enhance depth images captured by a TOF depth sensor spatially and temporally. As shown in the experimental results, we could minimize inherent problems of depth images significantly by the proposed depth enhancement method. Furthermore, we showed the possibility of 3D video services based on video-plus-depth data in the MPEG-4 multimedia framework. We expect that the proposed 3D video service system can be used in future 3D multimedia applications. REFERENCES [1] C. Fehn, A 3D-TV system based on video plus depth information, Proc. of Asilomar Conference on Signals, Systems and Computers, pp , [2] C. Fehn, R. Barré, and S. Pastoor, Interactive 3-D TV- concepts and key technologies, Proceedings of the IEEE, vol. 94, no. 3, pp , [3] D. Scharstein and R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, International Jour. of Computer Vision, vol. 47, no. 1-3, pp. 7-42, [4] F. Pereira, MPEG-4: why, what, how and when, Signal Processing Image Communication, vol. 15, pp , [5] G.J. Iddan and G. Yahav, 3D imaging in the studio and elsewhere, Proc. of Videometrics and Optical Methods for 3D Shape Measurements, pp , [6] M. Kawakita, T. Kurita, H. Kikuchi, and S. Inoue, HDTV Axivision camera, Proc. of International Broadcasting Conference, pp , [7] S. Hu, S.S. Young, T. Hong, J.P. Reynolds, K. Krapels, B. Miller, J. Thomas, and O. Nguyen, Super-resolution for flash ladar imagery, Applied Optics, vol. 49, no. 5, pp , [8] S.M. Kim, J. Cha, J. Ryu, and K.H. Lee, Depth video enhancement for haptic interaction using a smooth surface reconstruction, IEICE Trans. on Information and System, vol. E89-D, pp , [9] J. Cho, S.Y. Kim, Y.S. Ho, and K. H. Lee, Dynamic 3D human actor generation method using a time-of-flight depth camera, IEEE Trans. on Consumer Electronics, vol. 54, no. 4, pp [10] J. Diebel and S. Thrun, An application of markov random fields to range sensing, Proc. of Advances in Neural In-formation Processing systems, pp , [11] B. Huhle, S. Fleck, and A. Schilling, Integrating 3D time-of-flight camera data and high resolution images for 3DTV applications, Proc. of 3DTV conference, pp. 1-4, [12] A. Redert, M. Beeck, C. Fehn, W. IJsselsteijn, M. Pollefeys, L. Van Gool, E. Ofek, I. Sexton, and P. Surman, ATTEST advanced threedimensional television system technologies, Proc. International Symposium on 3D Data Processing, pp , [13] H. Shum, S. Kang, and S. Chan, Survey of image-based representations and compression techniques, IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 11, pp , [14] L. Onural, Television in 3-D: what are the prospects, Proceedings of the IEEE, vol. 95, no. 6, pp , [15] T. Wiegand, M. Lightstone, D. Mukherjee, T.G. Campbell, and S.K. Mitra, Rate-distortion optimized mode selection for very low bit rate video coding and the emerging H.263 standard, IEEE Trans. on Circuit and System for Video Technology, vol. 6, no. 9, pp , [16] S.Y. Kim, S.B. Lee, and Y.S. Ho, Three-dimensional natural video system based on layered representation of depth maps, IEEE Trans. on Consumer Electronics, vol. 52, no. 3, pp , [17] A. Levin, D. Lischinski, and Y. Weiss, A closed-form solution to natural image matting, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp , [18] C. Tomasi and R. Manduchi, "Bilateral filtering for gray and color images," Proc. of International Conference on Computer Vision, pp , [19] J. Kopf, M.F. Cohen, D. Lischinski, and M. Uyttendaele, Joint bilateral upsampling, ACM Trans. on Graphics, vol. 26, no. 3, pp.1-6, [20] O. P. Gangwal and R.P. Berretty, Depth map post-processing for 3D-TV, Proc. of International Conference on Consumer Electronics, pp.1-2, [21] A. Rosenfeld and J. L. Pflatz, Sequential operations in digital picture processing, Jour. of ACM, vol. 13, no. 4. pp , [22] L. L. Maslyuk, A. Ignatenko, A. Zhirkov, A. Konushin, I. Park, M. Han, and Y. Bayakovski, Depth image-based representation and compression for static and animated 3-D objects, IEEE Trans. on Circuits and Systems for Video Technology, vol. 14, no. 7, pp , [23] J. Cha, Y.S. Ho, Y. Kim, J. Ryu, I. Oakley, A framework of haptic broadcasting, IEEE Multimedia, vol. 16, no 3, pp , [24] S.Y. Kim, E.K. Lee, and Y.S. Ho, Generation of ROI enhanced depth maps using stereoscopic cameras and a depth camera, IEEE Trans. on Broadcasting, vol. 54, no. 4, pp , Biographies Sung-Yeol Kim received M.S. and Ph.D degree in Information and Communication Engineering at the Gwangju Institute of Science and Technology (GIST), Korea, in 2003 and 2008, respectively. He is currently working as a research associate with the Department of Electrical Engineering and Computer Science, University of Tennessee at Knoxville, USA. His research interests include 3D video representation and processing, 3D mesh representation and processing, 3D TV and realistic broadcasting. Ji-Ho Cho received the MS degree in Information and Communications in 2005 and Ph. D degree in Mechatronics Engineering from Gwangju Institute of Science and Technology in Korea, and worked as an academic guest at Swiss Federal Institute of Technology ETHZ in Zürich in Currently, he works for the Intelligent Design and Graphics Laboratory at GIST as a postdoctoral research associate. His main interests lie in 3D Video and computational photography. Andreas Koschan (M 90) received the Diploma (M.S.) degree in computer science and the Dr.-Ing (Ph.D.) degree in computer engineering from the Technical University Berlin, Berlin, Germany, in 1985 and 1991, respectively. He is currently a Research Associate Professor with the Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville. His research work has primarily focused on color image processing and 3-D computer vision, including stereo vision and laser range finding techniques. He is a coauthor of two textbooks on 3-D image processing. Dr. Koschan is a member of the Society for Imaging Science and Technology Mongi A. Abidi (S 83 M 85) received his M.S. and Ph.D. degrees in electrical engineering from the University of Tennessee, Knoxville, in 1985 and 1987, respectively. He is currently a Professor with the Department of Electrical Engineering and Computer Science, University of Tennessee, directing research activities at the Imaging, Robotics, and Intelligent Systems Laboratory. He has published more than 300 papers and edited/written four books in the area of imaging and robotics. He received The Most Cited Paper Award in Computer Vision and Image Understanding for 2006 and 2007.

732 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 4, DECEMBER /$ IEEE

732 IEEE TRANSACTIONS ON BROADCASTING, VOL. 54, NO. 4, DECEMBER 2008 Generation of ROI Enhanced Depth Maps Using Stereoscopic Cameras and a Depth Camera Sung-Yeol Kim, Student Member, IEEE, Eun-Kyung Lee,