3D Video Processing Algorithms Part I. Sergey Smirnov Atanas Gotchev Sumeet Sen Gerhard Tech Heribert Brust

Size: px

Start display at page:

Download "3D Video Processing Algorithms Part I. Sergey Smirnov Atanas Gotchev Sumeet Sen Gerhard Tech Heribert Brust"

Silvia Burns
6 years ago
Views:

1 3D Video Processing Algorithms Part I Sergey Smirnov Atanas Gotchev Sumeet Sen Gerhard Tech Heribert Brust

2 Project No D Video Processing Algorithms Part I Sergey Smirnov, Atanas Gotchev, Sumeet Sen, Gerhard Tech, Heribert Brust Abstract: This report describes algorithms developed to enhance the quality of 3D video. At the preprocessing side, we have addressed the following scenarios: stereo-video with higher resolution to be downscaled to meet the resolution of mobile 3D display; stereo-video captured at noisy conditions (e.g. user-created content) to be denoised; depth map in the format view+depth to be further refined. At the post-processing side we address the problem of dealing with depth map impaired by blocky artifacts resulted from block transform based encoders, such as H.264. For all these cases, we investigate advanced algorithms and present experimental results illustrating their performance. Keywords: mobile 3D video resolution, mixed resolution coding, down-sampling, up-sampling, denoising, 3D grouping and transform-domain collaborative filtering, local polynomial approximation, bilateral filtering, hypothesis filtering, time consistency; depth map filtering.

3 Executive Summary We present the first part of pre- and post-processing methods for 3D video represented in different formats. In this report we concentrate on sampling rate conversion for stereo video, stereo-video denoising and refinement of depth maps in the view plus depth representation. Sampling rate conversion is required when higher definition video is to be downscaled to mobile resolution. It also appears in the mixed resolution stereo representations schemes, where one of the channels is deliberately downscaled for the sake of more effective compression and then upscaled back for visualization. For this, standard up- and down-sampling methods as well as an alternative simple FIR filter for down-sampling with variable cutoff frequency have been presented and evaluated. Coding experiments demonstrate that the simple FIR filter with a cutoff frequency of approx. 0.6 outperforms the standard methods. PSNR gains up to 1 db at a constant bit rate or bit rate savings up to 30% at a constant PSNR can be achieved. Denoising of stereo video might be needed when the content to be delivered to the mobile device has been created under low light conditions. Noisy channels are more problematic not only for creating pleasant stereo perception but also for compression, depth estimation and view synthesis. One of the most competitive video denoising methods, abbreviated as VBM3D (video block matching in 3D) has been evaluated for its applicability and performance for stereo video. Experiments demonstrate that the denoised left and right video channels are with very high quality, where all 3D visual cues are well preserved and in fact even enhanced. From implementation point of view the results show an equal performance of the algorithm when applied independently to the two channels or jointly. Marginal improvement can be expected only for content with high amount of motion. Deblocking of depth maps is perhaps one of the most important pre- and post-processing tasks for the representation format view plus depth since practitioners tend to employ standard, i.e. block transform based, compression methods. A set of five filtering approaches has been tested. Approaches vary from simple Gaussian smoothing through standard H.264 deblocking to more sophisticated methods utilizing structural and colour constraint from the presented color video channel. The methods have been optimized with respect to the quantization parameter of the H.264 compression used. The experiments have ranked the methods for their performance. For the best performing method, we have suggested practical modifications leading to a faster and memory-efficient implementation. We have extended the same method also to video and for more general types of depth impairments (e.g. resulting from fast depth estimation or noise). Our approach yield highly time-consistent depth sequences adequately restoring the depth properties of the 3D scenes. 2

4 Table of Contents 1 Introduction Evaluation of down-sampling methods for Mixed Resolution Coding Sampling Methods Standard anti-aliasing filters Standard interpolation filters FIR anti-aliasing filter with variable cutoff frequency (VCF) Coding Experiments Setup Results Filtering of color stereo video sequences Introduction Denoising of stereo video by VBM3D Experiments Restoration of block transform compressed depth maps Introduction Problem Formulation Depth maps filtering approaches Gaussian Filtering Adaptive H.264 Loop-Filtering Local Polynomial Approximation approach Bilateral Filter Hypothesis filtering approach Quality measures Experimental results Temporally-consistent filtering of depth map sequences Introduction Problem formulation Extending the filtering approach to video Experiments Results Conclusions

5 1 Introduction This deliverable consists of four parts. The first part deals with down-sampling and up-sampling of stereo video in the mixed resolution stereo representation. The second part deals with color channel filtering, particularly with denoising in order to increase quality of followed depth estimation and view synthesis. The third part describes methods for deblocking of depth maps impaired by compression artifacts. In the fourth part we extend the most effective filtering approach from the previous part for depth map sequences and more general types of depth map distortions. We especially target bettr time-consistency to avoid flickering and some other 3D artifacts in the synthesized views [37]. The first part is authored by Gerhard Tech and Heribert Brust from Fraunhofer HHI, the second part is authored by Sumeet Sen and Atanas Gotchev and the third and fourth parts are authored by Sergey Smirnov and Atanas Gotchev from TTY. 4

6 2 Evaluation of down-sampling methods for Mixed Resolution Coding The mixed resolution approach is based on the transmission of a full and a down-sampled view. In a pre-processing step one view of a stereoscopic sequence is decimated. The decimated and the full view are coded and transmitted. At the receiver side the decimated view is up-sampled again ([1], [2]). Although decimation and interpolation is a theoretically solved problem, in practice a great variety of up- and down-sampling methods exist. Differences are given in the design of antialiasing and interpolation filters. In this scope additional design factors that affect the performance of up and down-sampling have to be regarded. These factors are the distortions introduced by coding and the low resolution of content suitable for displaying on mobile devices. To achieve best overall quality using the mixed resolution approach two standard methods previously used by VCEG/MPEG Joint Video Team (JVT) are analyzed and evaluated in this section. Moreover an approach using a down-sampling filter with variable cutoff frequency is optimized and evaluated. 2.1 Sampling Methods The standard sampling methods discussed in sections and are implemented in the resample tool downconvert provided with Reference Software for Scalable Video Coding JSVM [3]. An implementation of the filter with variable cutoff frequency presented in section is part of the Mathworks Matlab Software [4]. All filters are applied separately in vertical and horizontal direction Standard anti-aliasing filters Sine windowed sinc (SWS) For down-sampling the filter given in [5] is used. Filter coefficients are given by the sinewindowed sinc-function shown in equation (1), for otherwise (1) This leads for a decimation of factor 2 with and to a 14-tap filter. In [5] this filter is collapsed to a 12-tap filter, whereas the software implementation clips it to a 8-tap filter. Magnitude and impulse response of this filter are shown in Figure 2.1 5

7 Figure 2.1 Impulse and Magnitude Response of sine windowed sinc down-sample filter Dyadic down-sampling filter (DDS) For down-sampling the dyadic filter presented in [6] is used. For the Mixed Resolution Approach decimation by factor two is sufficient. Therefore the filter has to be applied only once. Impulse and magnitude responses of the filter are shown in Figure 2.2. Figure 2.2: Impulse and Magnitude Response of dyadic down-sample filter Standard interpolation filters SVC normative up-sampling (SNU) Interpolation is based on a set of 4-taps filters. These integer-based 4-tap filters are originally derived from the Lanczos-3 filter. For a detailed description of the complex interpolation process please refer to [7] Dyadic up-sampling filter (DUS) After doubling of the sampling rate the AVC 6-tap half pel filter presented in [6] is applied for interpolation. The impulse and magnitude response are shown in Figure

8 Figure 2.3: Impulse and Magnitude Response of dyadic interpolation filter FIR anti-aliasing filter with variable cutoff frequency (VCF) Additional to the down-sampling filter provided with JSVM Reference software, a hammingwindowed FIR filter with varying cutoff frequencies has been evaluated. For a detailed description of the filter design please refer to [8]. Figure 2.4 and Figure 2.5 show the impulse and magnitude response for cutoff frequencies of 0.4 and 0.6. The order of the filter has been set to 10. Figure 2.4 Impulse and Magnitude Response of VCF filter with normalized cutoff frequency 0.4 Figure 2.5 Impulse and Magnitude Response of VCF filter with normalized cutoff frequency 0.6 7

9 2.2 Coding Experiments Setup For the evaluation of the down-sampling filters one view of each sequence has been downsampled, encoded and up-sampled again. Codec parameters are given in Table 2.1. Profile GOP Size Symbol Mode 8x8 Transform Table 2.1: codec settings Baseline 1 (IPPP) CAVLC Disabled Search Range 48 Intra Period 16 Quantization Parameter 24, 28, 32, 36, 40 The filter combinations shown in Table 2.2 have been examined. The first two combinations are the standard filters provided with JSVM Software. Note that the SWS and SNU filters introduce and remove a shift of a half pel, hence a combination with the other filters is not possible. The last nine combinations utilize the VCF filter with cutoff frequencies from 0.1 to 0.9 for antialiasing and the DUS filter for interpolation. Table 2.2: combinations of evaluated up and down-sampling methods and cutoff frequencies Down DDS SWS VCF VCF VCF VCF VCF VCF VCF VCF VCF Cutoff ~0.4 ~ Up DUS SNU DUS DUS DUS DUS DUS DUS DUS DUS DUS The six sequences from the coding test set of the stereo video database [9] have been used for evaluation. This leads to a total of number of 6 (Sequences) x 11 (Up/Down Combinations) x 5 (QPs) = 330 sequences that have been coded. 8

10 2.2.2 Results Results of coding experiments are presented in Figure 2.6. The curves depict the PSNR vs. the bit rate of the down-sampled, coded and re-up-sampled right view. The uncoded full right view has been used for reference. The solid curves show results for the VCF down-sampling filter in combination with the dyadic up-sampling filter. For each QP the cutoff frequency was varied from 0.1 to 0.9 with a step size of 0.1. In Figure 2.6 the corresponding nine points are marked with crosses for each QP. With an increased cutoff frequency more details retain in the smoothed picture, hence coding leads to an increased bit rate. Therefore the leftmost rate-distortion point for each QP corresponds to a cut off frequency of 0.1 and the rightmost point to a frequency of 0.9. The envelope of the solid curves is depicted as yellow dashed line in Figure 2.6 and gives the rate-distortion points with optimal cutoff frequency. Regarding PSNR measure, for most sequences and rate points the optimal cutoff frequency is around 0.6. Lower frequencies lead to an over-smoothing and a strongly reduced image quality. Higher frequencies result not only in a further increased bit rate but also in a reduced PSNR by introduction of aliasing artifacts. Results obtained using the standard methods provided with the JSVM Software are presented as black and magenta dashed lines in Figure 2.6. Since the cutoff frequency is fixed only the QP was varied here. It can be seen that the dyadic up- and down-sampling approach performs slightly better than the combination of SWS down-sampling filter and SVC normative upsampling filter. A comparison with the optimized VCF filter shows that both methods are outperformed. The VCF filter leads to PSNR gains up to 1 db at a constant bit rate or at a constant PSNR to bit rate savings up to 30% respectively. 9

11 Figure 2.6 the solid lines show PSNR vs. bit rate for the down-sampled, coded and re up-sampled view using the VCF-filter, each curve represents a fixed QP, the cutoff frequency increases from left (0.1) to right (0.9), the dashed yellow curve is the envelope of the solid lines and shows the optimal cutoff frequencies; the dashed magenta and black curves show the results for varying QPs obtained with the JSVM tools 10

12 3 Filtering of color stereo video sequences 3.1 Introduction In the recent years, denoising of still images and video has received high interest due to the availability of mobile imaging platforms and the recent trends in user-created content. Capture of images and video have became quite popular with the use of consumer and compact cameras. Content created by users using nonprofessional equipment has been spreading through content-sharing platforms. In many cases such content is created in low illumination conditions and is quite noisy. This has determined the research interest in developing high performance denoising methods. The state-of-the-art denoising approaches seek for similarities between non-local patches within images or video frames and utilize them for getting highly over-completed and sparse representations, usually in transform domain, where the noise can be effectively separated from the information signals and subsequently suppressed [11],[12],[13],[14]. Methods based on non-local means [11] and collaborative non-local transform-domain filtering [13] are considered as the most powerful denoising approaches. We refer to the review paper [12] for a nice overview of the topic. In our development, we consider a scenario where the input stereo video is impaired by noise. We try to evaluate the importance of having more information, as in stereo, for achieving better denoising results. Similar problems have been addressed in [15], [16], [17] where non-local means have been applied on multiple frames or along with given depth map in noisy multi-view setting. In our setting, we adopt the collaborative transform-domain filtering approach, known as 3D Block-Matching (BM3D) [13] and its video version VBM3D [14] as they have shown superior performance for conventional 2D video. We aim at quantifying the performance of this algorithm for stereo video and at investigating the advantage stereo video would bring to the approach while using sparse 3D transform-domain collaborative filtering. 3.2 Denoising of stereo video by VBM3D We have applied the VB3D algorithm as in [14]. The algorithm operates by identifying similar blocks in the spatial and temporal neighborhood of a reference block. The similarity is measured by Eucledian distance and the similar blocks are collected in a stack (3D block). This step is called grouping. The advantage of grouping is that highly similar signal fragments are put and processed together. The noise is then suppressed using collaborative filtering in DCT domain which takes the advantage of the increased correlation between the grouped blocks. For video, rhe denoising is performed in two steps; predictive-search block-matching is combined with collaborative hard-thresholding in the 1st step and with collaborative Wiener filtering in the second step. Figure 7 shows a pictorial representation of the algorithm. The predictive-search block-matching is performed for successive video frames, assuming that the intra-frame search has identified the similar blocks to the reference one. Then, these blocks are used to find new similar ones in positions close to their spatial positions (predictive search). Thus, the similarity search is extended along temporal dimension with no need of explicit motion estimation. The algorithm essentially depends on the search range within the current video frame and with respect to the reference block (intra-frame search) and the search range for each similar block along the temporal axis. For stereo, it is straightforward to extend the algorithm to search for similar blocks in the other given view as well. In practice this would require some knowledge about the disparity range so to adjust the inter-view search range. In this study we are interested in two cases: in the first case the two noisy video channels are denoised independently using VBM3D and in the second case they are denoised jointly using the modified approach. 11

13 3.3 Experiments Figure 7. Video 3D block-matching denoisng approach In the first experiment, we add white Gaussian noise to ground true stereo video sequences, then denoise either jointly or individually and then measure the denosing performance in terms of frame-wise PSNR between ground true and denoised channels. The test sequences Horse and Car of resolution 640x360 were used in the experiments. Figure 8 illustrates the experimental setting. Left channel video Right channel video Left channel video Right channel video Interleaved video Denoising block (VBM3D) Denoising block (VBM3D) Denoised Left channel video Denoising block (VBM3D) Denoised Right channel video Denoised Interleaved video Plot frame wise PSNR with original sequence Plot frame wise PSNR with original sequence Denoised Left channel video Denoised Right channel video Plot frame wise PSNR with original sequence Plot frame wise PSNR with original sequence Figure 8. Experimental setting for denoising of stereo video The results for Horse sequences are given in Figure 9 and the results for Car sequence are given in Figure

14 Figure 9. Denoising results for 'Horse' for the left and right channels. Red : noisy vs noise-free; ligh-blue : jointly denoised vs noise-free; blue : individually denoised vs noise-free Figure 10. Denoising results for 'Car' for the left and right channels. Red : noisy vs noise-free; ligh-blue : jointly denoised vs noise-free; blue : individually denoised vs noise-free As can be seen in the figures, the stereo adds little to the denoising performance. The jointly denoised and individually denoised channels go close to each other with a small preference for the individual denoising. In the second experiment, we put the noisy and denoised sequences to tasks such as depth estimation and view synthesis. The test involved the same noise-free and noisy sequences. Using the FhG s depth estimator [18], the depth was estimated for the following stereo sequences a) noise-free (i.e. ground true) sequences; b) noisy sequences; c) individually denoised sequences (denoised data 1); d) jointly denoised sequences (denoised data 2). The obtained depths were used to render the right channel using the corresponding left (noise-free, noisy, or denoised) channel. The resulted right channel video sequences were compared with the original ones in terms of PSNR. The results are presented in Figure 11 and Figure

15 PSNR [db] PSNR [db] MOBILE3DTV 35 horse Noise-free data Noisy data Denoised data1 Denoised data Frame index 34 car Noise-free data Noisy data Denoised data1 Denoised data Frame index Figure 11. PSNR of ground true vs synthesized right channels out of different depth maps (see Legend) 14

16 PSNR [db] PSNR [db] MOBILE3DTV 33.4 horse Frame index 33 car Frame index Figure 12. Zoomed version of Figure 11 The denoising plays a substantial role in improving the quality of the synthesized view. The synthesized views out of denoised data are even better than those rendered using depth estimated out of the noise-free data. This suggests that beside the simulated added noise, the original videos contained some small amount of inherent noise which impeded the depth estimation but was suppressed by the denoising technique. Improving the depth estimation quality and subsequently, the quality of the synthesized view is a nice property of the VBM3D algorithm. In terms of depth estimation and view rendering, and for the horse test sequence, the approach of individual denoising of left and right channels showed better performance. For the car sequence however, the approach of joint denoising was superior. This is caused by differences in the content. Horse data contains small amount of motion, there is static background dominating in the scene. There are also luminance differences between the left and right channel. Correspondingly, VBM3D takes more advantage of finding more similar blocks along temporal domain than between views. Thus, individual processing of views turn to be more successful. In opposite, Car contains more motion, i.e. more changes along temporal axis. The similarity search finds more similar blocks between views and filters them collaboratively in a successful manner. 15

17 4 Restoration of block transform compressed depth maps 4.1 Introduction One of the 3D video formats, studied within the Mobile3DTV project, is informally called video plus depth, where the 2D video frames are augmented with per-pixel depth information. The 2D color video is represented in its ordinary form (e.g. in luminance-chrominance space) while the associated depth is represented as a quantized (gray-scale) map ranging from the minimum to the maximum distance with respect to the assumed camera position. Figure 13 illustrates the concept of view plus depth 3D representation for the popular test sequence Ballet dancer. Figure 13. Illustration of the 'view+depth' format concept Such a representation has a number of benefits: it ensures backward compatibility for legacy devices and offers easy rendering of virtual views for 3DTV and Free-viewpoint TV applications, while being also compression-friendly. The latter feature is based on the observation that the depth channel is more compression-susceptible than any other color video channel for delivering the same 3D scene geometry information. We refer to Deliverables D2.2, D2.3, and D2.5 for more details about the compressibility of depth maps. Depth image has two noticeable peculiarities. First, it is an image never seen by the viewer. It is used for rendering new views only (so-called depth-image based rendering - DIBR). Second, being a range map, it exhibits smooth regions representing objects of the same distance, delineated by sharp transitions (object boundaries). Thus, it is quite different from color texture images compressible with block transform based compression methods. This peculiarity has been addressed in designing compression schemes especially tailored for depth images [19], [20]. Nevertheless, the block transform-based video coding schemes have been favored in rateallocation studies because of the existing standardized encoders, such as H.264 and MPEG [21], [22]. In these studies two rate-allocation approaches have been adopted. In the first approach the bit allocation has been optimized jointly for the video and depth to minimize the rendering distortion of the desired virtual view [21]. In the second approach, the video quality has been maximized for the sake of backward compatibility while the depth has been encoded with a small fraction (10-15%) of the total bit rate [22]. The H.264 coding scheme has also been adopted within the project, where the total bit budget between color video and depth has been carefully jointly optimized [23] (see also D2.2 and D2.5). In the above rate-allocation approaches, especially for low bit rates, depth has been compressed by enforcing strong quantization of DCT coefficients. This creates the well-known blocking artifacts which are generic for block-transform-based compression schemes. For the case of depth images, blocking leads to distorted depth discontinuities and therefore distorted geometrical properties and object boundaries in the rendered view. The problem is illustrated by 16

18 Figure 14. The problem can be partially addressed by simple (e.g. Gaussian) smoothing an approach used also for mitigating occlusion effects. While simple, this approach is also weak as it destroys true sharp boundaries and impedes true virtual view rendering. We study the problem of restoration of compressed depth maps affected by blocky artifacts from two points of view. Our first aim is to adapt and compare state-of-the-art methods, originally designed to handle similar problems. We are interested in two groups of methods: methods from the first group regard the depth image as it is, i.e. they process it independently from the available color video. Methods from the second group utilize structural information from the video channel in order to improve the depth map restoration. Our second aim is to identify appropriate quality measures to quantify the distortions in the depth image and their effect on the rendered virtual view. (a) (b) (c) Figure 14. Teddy dataset. (a) ground truth depth; (b) rendered view using (a) (without occlusion filling); (c) ground truth depth compressed as H.264 I-frame with QP=51; (d) rendered view using (c) 4.2 Problem Formulation Consider an individual colour video frame in some colour space. For sake of clarity we consider YUV colour space however most of the developments can be done in RGB as well. We denote the colour frame as three-component vector, where (d) 17

19 is a spatial variable, being the image domain. Along with the video frame, we consider the associated per-pixel depth. A new, virtual view can be synthesized out of the given (reference) color frame and depth, applying projective geometry and knowledge about the reference view camera [24]. The synthesized view is composed of two parts,, where denotes the visible pixels from the position of the virtual view camera and denotes the pixels of occluded areas. The corresponding domains are denoted by correspondingly,. We consider the case where both are to be coded as H.264 intra frames with some QPs, this leafing to their quantized versions. We model the effect of quantization as quantization noise added to the uncompressed signal. Namely, The quantization noise terms added to the color channels and the depth channel are considered independent white Gaussian processes:,. While this modeling is simple, it has proven quite effective for mitigating the blocking artifacts arising from quantization of transform coefficients. In particular, it allows for establishing a direct link between the quantization parameter (QP) and the quantization noise variance to be used for tuning deblocking filtering algorithms [25]. Let us denote by the virtual view synthesized out of quantized depth and quantized reference view. Unnatural discontinuities at the boundaries of the transform blocks (the blocking artifacts) in the quantized depth image cause geometrical distortions and distorted object boundaries in the rendered view. The goal of the restoration of compressed depth maps is to mitigate the blocking effects in the depth image domain, i.e. to obtain a deblocked depth image estimate, which would be closer to the original, uncompressed depth, and would improve the quality of the rendered view. 4.3 Depth maps filtering approaches We have implemented and compared five methods which can be grouped into two groups. First two methods work directly on the depth image making no use of the given reference color video frame. These methods are simple and by choosing them we wanted to check the effect of simple or adaptive smoothing of the depth image on the rendered view. The second set groups methods which essentially utilize structural information from the video channel(s). The assumption here is that the video channel is coded with better quality and as such it can provide trustful information about objects at different depth to be used for restoring the true depth discontinuities. We aim at utilizing structural information such as pixel neighborhood or color (dis-)similarity from the given video frame to infer the true depth values Gaussian Filtering Gaussian smoothing is a popular technique for getting rid of usually high-frequency contaminations. The method suggests convolving the noisy image with 2D discrete smoothing kernel in the form: The standard deviation is a free parameter which can be used to control the imposed smoothness. For our experiments we have tuned it as a function of the H.264 Quantization Parameter. The main drawback of the Gaussian filtering is that is applies fixed-size 18 (2) (3) (4)

20 rectangular window across true object boundaries and thus smoothes out true image features together with the noise Adaptive H.264 Loop-Filtering The H.264 video compression standard has a built-in deblocking algorithm addressing the problem of adaptive smoothing. It works adaptively on boundaries trying to avoid smoothing of real signal discontinuities. To achieve this, two adaptive threshold functions have been experimentally defined to determine whether or not to apply smoothing across block boundaries. The functions depend on the QP as well as on two encoder-selectable offsets, denoted by and included and transmitted in the slice header. These two offsets are the only user-tunable parameters allowing some adjustment of the smoothing for a specific application. For more details on the H.264 deblocking we refer to [26] Local Polynomial Approximation approach The anisotropic local polynomial approximation (LPA) is a point-wise method for adaptive estimation in noisy conditions [27]. For every point of the image, local polynomial sectorialneighborhood estimates are fitted for different directions. In the simpler case, instead of sectors, 1D directional estimates of four (by 90 degrees) or eight (by 45 degrees) different directions are used. The length of each estimate, denoted as scale, is adjusted so to meet the compromise between the exact polynomial model (low bias) and enough smoothing (low variance). A statistical criterion, denoted as Intersection of Confidence Intervals (ICI) rule is used to find this compromise [28], [29], i.e. the optimal scale for each direction. These optimal scales in each direction determine an anisotropic polygonal neighborhood for every point of the image well adapted to the structure of the image. This neighborhood has been successfully utilized for shape-adaptive transform-based color image denoising and deblurring [25]. In the spirit of [25], we use the quantized luminance channel as source of structural information. The image is convolved with a set of 1D directional polynomial kernels, where is the set of different lengths (scales) and are the directions, thus obtaining the estimates. In order to find the optimal scale for each direction (hereafter the notation of direction is omitted), so-called confidence intervals are formed first: (Goldenshluger & Nemirovski, 1997),(Katkovnik V., 1999). The optimal scale is the largest scale (in number of pixels), which ensures a non-empty intersection of confidence intervals. Figure 15a illustrates the optimal scale for each pixel (encoded with different gray value) for a particular direction. The optimal scales for all directions form an adaptive polygonal neighborhood with current pixel being in the centre, as illustrated in Figure 15b. After finding optimal neighborhood in the luminance image domain, the same is used for smoothing the depth image (cf. Figure 15c). The smoothing is done by fitting a plane within the neighborhood. Since LPA is point-wise procedure, neighborhoods for each pixel overlap. Correspondingly, depth pixels get estimated multiple times depending on how many times they get inside a neighborhood. The final estimate or each depth pixel is obtained by averaging the aggregated planar estimates for the pixel. Figure 15e illustrates the result of LPA-ICI filtering. Note that the scheme depends on two parameters: the noise variance of the luminance channel and the positive threshold parameter. The former depends on the quantization of the color video. We assume low quantization noise. The latter can be adjusted so to favor higher amount 19

of smoothing. We have optimized it with respect to the quantization parameter of the depth channel:. a) b) c) d) e) Figure 15 LPA-ICI filtering of depth maps.

(compressed) depth e) filtered by LPA-ICI. 4.3.4 Bilateral Filter The goal of bilateral filtering is to smooth the image while preserves edges [30].

For grayscale images, local weights of neighbors are calculated based on both their spatial distance and their photometric similarity, favoring nearer values to distant ones

For color images, bilateral filtering uses color distance to distinguish photometric similarity between pixels, thus reducing phantom colors in the filtered image.

21 of smoothing. We have optimized it with respect to the quantization parameter of the depth channel:. a) b) c) d) e) Figure 15 LPA-ICI filtering of depth maps. a) Optimal scales for one of the directions; b) luminance channel with some of found optimal neighbors; c) compressed depth with the same neighbours overlaid; d) input (compressed) depth e) filtered by LPA-ICI Bilateral Filter The goal of bilateral filtering is to smooth the image while preserves edges [30]. It utilizes information from all color channels to specify suitable weights for local (non-linear) neighborhood filtering. For grayscale images, local weights of neighbors are calculated based on both their spatial distance and their photometric similarity, favoring nearer values to distant ones in both spatial domain and intensity range. For color images, bilateral filtering uses color distance to distinguish photometric similarity between pixels, thus reducing phantom colors in the filtered image. Figure 16 a-e illustrates the approach in forming the filtering window. Such collaboratively-weighted neighborhood defined by the color image is applicable also to the depth channel. The approach is similar also to the one used in depth estimation where contour color information has been used for finding correspondences [31]. In our setting, we have adopted a version of the bilateral filter as in [32]. 20

A result of bilateral filtering is given

22 where, and. The two parameters and determine the spatial extent and the range extent of the weighting functions correspondingly. We have optimized them with respect to the QP:. A result of bilateral filtering is given in Figure 16 f,g. (5) a) b) c) d) e) f) g) Figure 16. Bilateral filtering of depth maps. a) color frame with reference pixel (in red); b) spatial proximity; c) colour similarity; d) colour window; e) combined spatial-colour window; f) blocky depth; g) bilaterally filtered depth 21

4.3.5 Hypothesis filtering approach Originally, the considered method has been developed for increasing the resolution of lowresolution depth images, utilizing information from the high-resolution

23 4.3.5 Hypothesis filtering approach Originally, the considered method has been developed for increasing the resolution of lowresolution depth images, utilizing information from the high-resolution color image [32]. This method is perfectly applicable to our problem of suppression of compression artefacts and restoration of real discontinuities in the depth map. In the original approach, a 3D cost volume is constructed frame-wise out of several depth hypothesizes and the hypothesis with lowest cost is selected as a refined depth value at the current iteration. More specifically, the cost volume at the i-th iteration is formed as truncated quadratic difference, (6) where d is the potential depth candidate, is the current depth estimate at coordinates x and L is the search range controlled by a constant. The obtained slices of the cost volume for different values of d somehow keep the degraded pattern of z, as illustrated in Figure 17 left. Therefore, each slice of the cost volume undergoes joint bilateral filtering, i.e. each pixel of the cost slice is obtained as a weighted average of neighboring pixels where weights are also modified by the color similarity measured as l 1 distance between the corresponding pixel of the color video frame and the neighboring ones where, and is the neighborhood of coordinate x. The reason of applying bilateral filtering is two-fold: it assumes the depth reflects the piecewise smoothness of the surfaces of the given 3D scene and that the depth is correlated with the local scene color (same local color corresponds to constant depth). Our experimental tests demonstrated that filtering of the cost volume (1) is more effective than directly filtering the noisy depth. After bilateral filtering, the slices get smoothed (Figure 17 right) and the depth for the next iteration is obtained as (7). (8) Figure 17 Result of filtering of cost volume. Left: unfiltered cost function; right: bilaterally-filtered cost function. The hypothesis filtering approach is illustrated in Figure 18. The approach methodologically assumes three steps: (1) form a cost volume, (2) filter the cost volume, (3) peak the min hypothesis. In the original approach [32], a further refinement of the depth is suggested: instead of selecting the depth giving the minimum cost, as of Eq. (3), a quadratic function is fit around that minimum and the minimum value of that function is selected instead. 22

Figure 18 Block diagram of hypothesis filtering We suggest several modifications to the original approach to make it more memory-efficient and to improve its speed.

Instead, the cost function is formed for the required neighbourhood only and then filtering applies, i.e. Furthermore, the computation cost is reduced by assuming that not all depth hypotheses are applicable for the current pixel.

24 Figure 18 Block diagram of hypothesis filtering We suggest several modifications to the original approach to make it more memory-efficient and to improve its speed. It is straightforward to figure out that there is no need to form cost volume in order to obtain the depth estimate for a given coordinate x at the i-th iteration. Instead, the cost function is formed for the required neighbourhood only and then filtering applies, i.e. Furthermore, the computation cost is reduced by assuming that not all depth hypotheses are applicable for the current pixel. A safe assumption is that only depths within the range where have to be checked. (9) Figure 19 Histogram of non-compressed and compressed depth map Additionally, depth range is scaled with the purpose to further reduce the number of hypothesizes. This step is especially efficient for the compression (blocky) artifacts. For compressed depth maps, the depth range appears to be sparse due to the quantization effect. Figure 19 illustrates histograms of depth values before and after compression so to confirm the use of rescaled search range of depth hypotheses. This modification speeds up the procedure and relies on the subsequent quadratic interpolation to find the true minimum. A pseudo-code of the suggested procedure in Eq.(4) is given in Table 1. 23

25 seconds MOBILE3DTV Table 1. Pseudo-code of modified hypothesis filtering Rescale the range of Noisy Depth Image For every (x,y) in Noisy Depth Image D = read window of depth frame around (x,y) C = read window of color frame around (x,y) W = calculate bilateral weights from C; Xmin = max color difference; For d=min(d) to max(d) X = W*MIN((D-d)^2, threshold)/w; If sum(x) < Xmin Depth_new(x,y) = d; Xmin = sum(x); End End End Rescale the range of Filtered Depth Bilateral filter directly on depth Original approach No cost volume Slices Figure 20 Execution time of different implementations of filtering approach Figure 20 illustrates the achievements in terms of speed. The figure shows experiments with depth filtering of a scene with different implementations of the filtering procedure. All implementations have been written in C and then compiled into MEX files to be run from Matlab environment. The vertical axis shows the execution time in seconds and the horizontal line shows the number of slices employed (and thus the dynamic range assumed). In the figure, the dotted curve shows single pass bilateral filtering. It does not depend on the dynamic range but on the window size, thus it is a constant in the figure. The red line shows the computational time for the original approach implemented as a three step procedure for the full dynamic range. Naturally, it is linear function with respect to the slices to be filtered. Our implementation (blue 24

26 curve) applying reduced dynamic range is also linearly depending on the number of slices but with dramatically reduced steepness. 4.4 Quality measures We have considered two groups of quality measures, the first group operating directly on the depth images (true and processed) and the second group operating on the rendered view (true and restored). While the measures in the first group are simpler and faster to calculate, the measures from the second group are more realistic to subjective perception. PSNR of Restored Depth compares the compressed or restored depth against ground true depth where is number of pixels of the depth image. Percentage of bad pixels is a measure originally used to compare estimated depths from stereo [34]. It counts the number of pixels differing more that a pre-specified threshold Consider the gradient of the difference between true depth and approximated depth. By Depth Consistency we denote the percentage of pixels, having magnitude of that gradient higher than a pre-specified threshold. The measure favors non-smooth areas in the restored depth considered as main source of geometrical distortion, as illustrated in (10) (11) (12) Figure 21. Results of tresholding in PSNR of Rendered View. Analogously to formula (9) but taken over the rendered view. Gradient-normalized RMSE has been suggested in [36] as a performance metric for optical flow estimation algorithms to make it more robust to local intensity variations in textured areas. In our implementation we calculate it over the luminance channel of rendered image and excluding true occluded areas 25

27 Discontinuity Falses accounts for the percentage of wrong occlusions in the rendered channel. Those are either new occlusions of initially non-occluded pixels or falsely disoccluded pixels where is cardinality (number of elements) of a domain. 4.5 Experimental results We present two experiments. In the first experiment, we compare the performance of all depth restoration algorithms assuming the true color channel is given (it has been also used in the optimization of the tunable parameters). In the second experiment we compare the effect of depth restoration in the case of mild quantization of the color channel. Figure 22 illustrates the performance of some of the filtering techniques. Rendering of the right channel has been accomplished using the original left channel and either compressed or filtered depth. No occlusion filling has been applied. Results of the first experiment are presented in Figure 23. Along x-axis of all plots, the H.364 QPs are given and the area of interest is between 30 and 50. All measures but the BAD one distinguish the methods in a consistent way. The group of structurally-constrained methods clearly outperforms the simple methods working on the depth image only. The two PSNR-based seem to be less reliable in characterizing the performance of the methods. The three remained measures, namely Depth Consistency, Discontinuity Falses and Gradient-normalized RMSE perform in a consistent manner. While NRMSE is perhaps the measure closest to the subjective perception, we favor also the other two measures of this group as they are relatively simple and do not require calculation of the warped (rendered) image. To characterize the consistency of our optimized parameters, in Figure 23g, we show the trend of CONSIST calculated for the algorithms with parameters optimized for NMRSE. One can see that the trend is pretty consistent with that of Figure 2e (where the methods are both optimized and compared with respect to CONSIST). The same can be seen while comparing Figure 23h with Figure 23f. In the former, the NRMSE is calculated over the test set while the algorithms parameters are optimized over the training set with respect to CONSIST. The measure shows the same trend as in the case when the algorithms are optimized with respect to the same measure. So far, we have been working with uncompressed color channel. It has been involved in the optimizations and comparisons. Our aim was to characterize the pure influence of the depth restoration only. In the second experiment we play with quantized color channel. We assume mild quantization of the color image, e.g. by QP=35 and two QPs, 35 and 45 for the depth. For our test imagery, the first depth QP corresponds to about 10% of the total bitrate. The NRMSE of the rendered channel is calculated with respect to the channel rendered from uncompressed color and depth. The results are given in Figure 24. One can see that the depth post- processing clearly makes a difference allowing to use stronger quantization of the depth channel and still to achieve good quality. (13) (14) 26

28 a) b) c) d) e) f) Figure 22. Filtering of compressed depth maps. a) decompressed depth map; b) right channel rendered using original left and depth from a); c) depth filtered by bilateral filer; d) right channel rendered using c); e) depth filtered by hypothesis filter; f) right channel rendered using e) 27

29 Depth Consistency (%) Percent (%) Normalized RMSE (db) Depth Consistency (%) Percent (%) Normalized RMSE (db) Bad Pixels Percentage (%) Discontinuity Falses (%) PSNR of Restored Depth (db) PSNR of Rendered Channel (db) MOBILE3DTV (a) No Filtering H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution (b) No Filtering H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution (c) H.264 Quantization Parameter No Filtering 35 H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering 30 Bilateral Filtering Super Resolution 25 (d) H.264 Quantization Parameter No Filtering H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution (e) No Filtering H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution (f) No Filtering H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution (g) H.264 Quantization Parameter No Filtering 5 4 H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution (h) No Filtering H.264 Loop Filter Gaussian Smooth LPA-ICI Filtering Bilateral Filtering Super Resolution Figure 23. Experiment 1. Horisontal axes show H.264 QP. (a)-(f) Performance of selected algorithms optimized for and compared by same measure. (g) Peformance measured by CONSIST of algorithms optimized for NRMSE. (h) Peformance measured by NRMSE for algorithms optimized for CONSIST. 28

30 True Color, True Depth Color QP=35, True Depth, NRMSE=10 True Color, Depth QP=35, NRMSE=23 Color QP=35, Depth QP=35, NRMSE=24 29

31 True Color, Depth QP=45, NRMSE=31 Color QP=35, Depth QP=45, NRMSE=32 True Color, Filtered Depth from QP=45, NRMSE=21 Color QP=35, Filtered Depth from QP=45, NRMSE=22 Figure 24. Experiment 2. Effect of compressed color and compressed and filtered depth to the quality of rendered view 30

32 5 Temporally-consistent filtering of depth map sequences 5.1 Introduction In the previous section we addressed the problem of refinement of depth maps impaired by compression artefacts. The quality of the depth maps also depends on the way they have been generated: that is either through depth-from-stereo or depth-from-multiview type of algorithms or using special depth sensors based on time-of-flight (ToF) principles or laser scanners or structural light. When accompanying video sequences, the consistency of successive depth maps in the sequence becomes an issue. Time-inconsistent depth sequences might cause flickering in the synthesized views as well as other 3D-specific artifacts [37]. The time-consistency issue has been addressed mainly at the stage of depth estimation either by adding a smoothing constraint along temporal dimension in the depth estimation global optimization procedure or by simple median filtering along successive depth frames [38], [39]. In this section, we address the problem of filtering of depth map sequences, which are impaired either by inaccurate depth estimation or noise or compression artifacts. We extend the approach from Section 4 toward video to tackle the time-consistency issue. 5.2 Problem formulation We extend the formulation in Sub-section 4.2, to add the temporal dimension. Consider color video sequence in YUV color space, accompanied by the associated per-pixel depth, where is a spatial variable being the image domain, and is frame index. The virtual view to be synthesized out of the given (reference) color frame and depth at time t, is denotebe by. It is composed of two parts,, where denotes the visible pixels from the position of the virtual view camera and denotes the pixels of occluded areas. The corresponding domains are denoted by correspondingly,. We consider the case where the depth sequence has been degraded by some impairment added to the true depth: Finally, we denote by the virtual view synthesized out of the degraded depth and by the virtual view synthesized out of processed depth and the given reference view. The goal of the depth filtering is to get an estimate of the depth sequence closer to the ground true depth sequence and providing synthesized virtual view with improved quality Extending the filtering approach to video In Section 4, we found out that the hypothesis filter gives superior performance when applied to individual depth frames impaired by compression artifacts. Here, we extend the same approach to video and to more general types of depth artifacts. Eq. (8) is extended to video sequences as follows where (15). 31

the filter parameters. Note, that the video filtering uses no explicit motion information. No motion estimation/compensation is applied.

33 This essentially means, that the depth hypotheses are checked within a parallelogram around the current depth voxel with coordinates (x,t). While the neighbouring voxels are weighted by their color similarities to the central one, the temporal distance is penalized separately from the spatial one to enable better flexibility in tuning the filter parameters. Note, that the video filtering uses no explicit motion information. No motion estimation/compensation is applied. We rely on the color (dis-)similarity weights to suppress sufficiently depth voxels changed considerably by motion. The hypothesis filtering procedure for video is illustrated in Figure Experiments Figure 25. Extension of hypothesis filtering to video We present two experiments. In the first experiment, we consider the depth sequence as estimated from noisy stereo sequences. Namely, a given stereo sequence and is used to estimate the depth sequence. Then, white noise is added to the stereo video to obtain noisy stereo video, which is used to estimate the impaired depth sequence The latter is filtered by the suggested video hypothesis filtering. For comparison, median filtering is applied to the noisy depth sequence and to the per-frame hypothesis filtered data. In our practical setting, we have used a stereo pair of the Cone test data from the Middlebury Evaluation Test bench [40]. For that given stereo pair we have the ground true depth and we also estimated the depth by the method in [41]. To simulate a stereo video, we repeated the stereo pair 40 times to form 40 successive video frames, then added different amount of noise to each frame and estimated the depth from each so-obtained noisy stereo frame. The results of different filtering techniques applied to the noisy depth sequence are given in Figure 26. The results are consistent over all measures and show considerable improvement along the temporal dimension when the video extension of the hypothesis filtering is applied. The video hypothesis filtering not only manages to equalize the quality along the time axis but also improves the depth estimates compared to ones obtained from noise-free data by the method from [41]. In the second experiment we simulate blocky artifacts in the depth channel. To create ground true video plus depth, we circularly shifted the same cone sequence with a radius of 10 pixels 32

34 also adding some noise to the shifting vectors and then crop the central parts of the so-obtained frames. Thus, we got a sequence simulating circular motion of the camera plus some small amount of shaking. The sequence was compressed by H.264 encoder in IPIPIP mode varying slightly the quantization parameter (QP) per frame to simulate different amount of blockiness in successive frames. The filtering results are presented in Figure 27. We kept the following filters: single-frame hypothesis filter, the same followed by median filtering along time, and video hypothesis filtering. As it can be seen in the figure, the video version of hypothesis filtering has the most consistent performance. It performs especially well around edges. The rendered frames are with similar quality thus providing smooth and flicker-free experience. The only exception is the BAD metric, where the compressed depth seems to be the best. The metric, originally introduced to measure the performance of depth estimation algorithms, simply counts differences between ground true and processed pixels no matter how big or small (but above a threshold) the differences are. While all filtering algorithms introduce small changes over the whole image, those small changes seem to be more in percentage than the number of different pixels in the quantized depth image. However, what really matter are the bigger differences appearing around edges. These are well tackled by the filtering, as seen in the other metrics. Especially informative is the NRMSE, which measures the quality of the rendered channel being closer to the human perception. There, the new filtering approach truly excels. Finally, we provide some visual illustrations on the performance of the algorithm. We use the Book arrival sequence provided by Fraunhofer HHI, where the depth is estimated by the MPEG depth estimation software [42]. While it incorporates rather powerful techniques and yields highquality and time-consistent depth maps, our technique still adds some improvements. Figure 28shows the result of filtering for frame 20. From left to right, the figure shows the originallyestimated depth, then the depth obtained after median filtering along time, and then depth resulting from the proposed method. The depth estimation has failed around the face of the person entering the room and at the floor area. Median filtering manages to correct the depth of the floor but fails to correct the face of the person. The proposed method restores both the floor and the face. The same sequence has been compressed/decompressed with H.264 intra-frame and then filtered. The result of decompression and filtering is shown in Figure 29. Again, despite the substantial blocking artefacts, details as human faces have been successfully restored. 33

35 PSNR PSNR of Virtual Channel CONSIST Normalized RMSE BAD BAD near discontinuities MOBILE3DTV 5.4 Results 80 Cones 80 Cones Frame Cones Frame Cones Frame Cones Frame Cones Frame Noise-Free Estimate Noisy Estimate Noisy Estimate + Median(5 frm) Noisy + Hypotesis + Median(5 frm) Noisy + Hypothesis Noisy + Video Hypothesis (3 frm) Noisy + Video Hypothesis (5 frm) Frame Figure 26. Comparative results of filtering approaches as in Experiment 1 34

36 PSNR PSNR of Virtual Channel CONSIST Normalized RMSE BAD BAD near discontinuities MOBILE3DTV Cones Cones Frame Cones Frame Cones Frame Cones Frame Cones Frame Noisy Estimate Noisy + Hypothesis Noisy + Hypotesis + Median (7 frm) Noisy + Video Hypothesis (7 frm) Frame Figure 27. Comparative results of filtering approaches as in Experiment 2 35

proposed approach Figure 29. Filtering of compressed depth sequence.

37 Figure 28. Results of filtering of 'Book arrival' depth sequence. From left to right: originally-estimated depth; median-filtered; filtered by proposed approach Figure 29. Filtering of compressed depth sequence. From left to right: decompressed depth map; decompressed dep map filtered by proposed approach 36

Final report on coding algorithms for mobile 3DTV. Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin

Final report on coding algorithms for mobile 3DTV Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin MOBILE3DTV Project No. 216503 Final report on coding algorithms for mobile 3DTV Gerhard