Final report on coding algorithms for mobile 3DTV. Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin

Similar documents
Development and optimization of coding algorithms for mobile 3DTV. Gerhard Tech Heribert Brust Karsten Müller Anil Aksay Done Bugdayci

Advanced Video Coding: The new H.264 video compression standard

INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND AUDIO

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Implementation and analysis of Directional DCT in H.264

Coding of 3D Videos based on Visual Discomfort

3366 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 9, SEPTEMBER 2013

Depth Estimation for View Synthesis in Multiview Video Coding

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.

Video Quality Analysis for H.264 Based on Human Visual System

New Techniques for Improved Video Coding

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Digital Video Processing

Rate Distortion Optimization in Video Compression

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

Module 7 VIDEO CODING AND MOTION ESTIMATION

VIDEO COMPRESSION STANDARDS

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

High Efficiency Video Coding. Li Li 2016/10/18

Using animation to motivate motion

CMPT 365 Multimedia Systems. Media Compression - Video

Compression of Light Field Images using Projective 2-D Warping method and Block matching

Homogeneous Transcoding of HEVC for bit rate reduction

10.2 Video Compression with Motion Compensation 10.4 H H.263

Chapter 3 Image Registration. Chapter 3 Image Registration

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

5LSH0 Advanced Topics Video & Analysis

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

3D Video Processing Algorithms Part I. Sergey Smirnov Atanas Gotchev Sumeet Sen Gerhard Tech Heribert Brust

View Synthesis Prediction for Rate-Overhead Reduction in FTV

An Efficient Mode Selection Algorithm for H.264

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 5, MAY

IN the early 1980 s, video compression made the leap from

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

CHAPTER 3 DISPARITY AND DEPTH MAP COMPUTATION

Compression of Stereo Images using a Huffman-Zip Scheme

Features. Sequential encoding. Progressive encoding. Hierarchical encoding. Lossless encoding using a different strategy

JUNSHENG FU A Real-time Rate-distortion Oriented Joint Video Denoising and Compression Algorithm

EE795: Computer Vision and Intelligent Systems

EXAM SOLUTIONS. Image Processing and Computer Vision Course 2D1421 Monday, 13 th of March 2006,

Outline Introduction MPEG-2 MPEG-4. Video Compression. Introduction to MPEG. Prof. Pratikgiri Goswami

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Introduction to Video Encoding

Lecture 7, Video Coding, Motion Compensation Accuracy

Video Compression An Introduction

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

A New Data Format for Multiview Video

Next-Generation 3D Formats with Depth Map Support

Enhanced View Synthesis Prediction for Coding of Non-Coplanar 3D Video Sequences

Graph-based representation for multiview images with complex camera configurations

ISSN: An Efficient Fully Exploiting Spatial Correlation of Compress Compound Images in Advanced Video Coding

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Reduced Frame Quantization in Video Coding

Lecture 13 Video Coding H.264 / MPEG4 AVC

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

LBP-GUIDED DEPTH IMAGE FILTER. Rui Zhong, Ruimin Hu

View Synthesis for Multiview Video Compression

Model-Aided Coding: A New Approach to Incorporate Facial Animation into Motion-Compensated Video Coding

An Improved H.26L Coder Using Lagrangian Coder Control. Summary

FAST MOTION ESTIMATION WITH DUAL SEARCH WINDOW FOR STEREO 3D VIDEO ENCODING

Multimedia Systems Image III (Image Compression, JPEG) Mahdi Amiri April 2011 Sharif University of Technology

Video Coding Using Spatially Varying Transform

Motion Estimation for Video Coding Standards

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain

VHDL Implementation of H.264 Video Coding Standard

BLOCK MATCHING-BASED MOTION COMPENSATION WITH ARBITRARY ACCURACY USING ADAPTIVE INTERPOLATION FILTERS

Mesh Based Interpolative Coding (MBIC)

MPEG-4: Simple Profile (SP)

Redundancy and Correlation: Temporal

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction

Fast Mode Decision for H.264/AVC Using Mode Prediction

CONVERSION OF FREE-VIEWPOINT 3D MULTI-VIEW VIDEO FOR STEREOSCOPIC DISPLAYS

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

Optimized Progressive Coding of Stereo Images Using Discrete Wavelet Transform

Quality improving techniques in DIBR for free-viewpoint video Do, Q.L.; Zinger, S.; Morvan, Y.; de With, P.H.N.

Introduction to Medical Imaging (5XSA0) Module 5

ABSTRACT

The Scope of Picture and Video Coding Standardization

EE Low Complexity H.264 encoder for mobile applications

Scalable Extension of HEVC 한종기

Frequency Band Coding Mode Selection for Key Frames of Wyner-Ziv Video Coding

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Improving Intra Pixel prediction for H.264 video coding

An Optimized Template Matching Approach to Intra Coding in Video/Image Compression

Adaptive Quantization for Video Compression in Frequency Domain

Intra-Mode Indexed Nonuniform Quantization Parameter Matrices in AVC/H.264

Context based optimal shape coding

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Chapter 10 ZHU Yongxin, Winson

Recent, Current and Future Developments in Video Coding

MRT based Adaptive Transform Coder with Classified Vector Quantization (MATC-CVQ)

ENCODER COMPLEXITY REDUCTION WITH SELECTIVE MOTION MERGE IN HEVC ABHISHEK HASSAN THUNGARAJ. Presented to the Faculty of the Graduate School of

Introduction to Video Compression

Anno accademico 2006/2007. Davide Migliore

Review for the Final

In the name of Allah. the compassionate, the merciful

Video encoders have always been one of the resource

Transcription:

Final report on coding algorithms for mobile 3DTV Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin

MOBILE3DTV Project No. 216503 Final report on coding algorithms for mobile 3DTV Gerhard Tech, Karsten Müller, Philipp Merkle, Heribert Brust, Lina Jin Abstract: A low complexity view synthesis algorithm suitable for mobile devices has been developed and implemented. The implemented renderer provides two different modes. The first mode enables very fast processing by rounding disparities to integer values to avoid interpolation. The second mode supports floating point disparities by interpolation at sub pixel positions. For both modes pre-processing filters for the depth and post-processing filters for the rendered view have been implemented. A method for removing irrelevant information from depth maps in Video plus Depth coding is presented. Irrelevant edges and features in the depth map can be damped while the quality of the rendered view is retained. The processed depth maps can be coded at a reduced rate compared to unaltered data. Coding experiments show gains up to 0.5dB for the rendered view at the same bit rate. The integration of the PSNR-HVS, to the JMVC Software for Multiview coding is described. A QP dependent correction factor for the Lagrange multiplier has been determined. The modified rate-distortion optimization process leads to gains up to 1.6dB PSNR-HVS using the new video quality metric. A final summary of the stereo video formats and coding methods evaluated in the Mobile3DTV project is given; results and advancements are pointed out. Keywords: 3DTV, coding algorithms, Rendering, Depth map filtering, Perceptual video coding

Executive Summary This deliverable is tripartite: The first part describes the advances achieved in Video plus Depth coding. A rendering approach supporting sub-pixel accuracy and a filter for removing irrelevant signal parts from depth data are presented. The second part describes a software encoder using a new video quality metric. And third, a final summary on coding algorithms for mobile 3DTV is presented in the last part. For the Video plus Depth approach a low complexity view synthesis algorithm suitable for mobile devices has been developed and implemented. For fast processing each row of the synthesized view is rendered using data of a line of the corresponding video and depth frame sequentially. This minimizes the amount of needed memory as well as the number of memory accesses. The implemented renderer provides two different modes. The first mode enables very fast processing by rounding disparities to integer values to avoid interpolation. The second mode supports floating point disparities by interpolation at sub pixel positions. For both modes pre-processing filters for the depth and post-processing filters for the rendered view have been implemented. The renderer supports different data formats for disparity, e.g. inverse depth data or scaled disparities. A method for removing irrelevant information from depth maps in Video plus Depth coding is presented. The depth map is filtered in several iterations using a diffusion approach. In each iteration smoothing is carried out in local sample neighborhoods, considering the distortion introduced into a rendered view. Smoothing is only applied when the rendered view is not affected. Therefore, irrelevant edges and features in the depth map can be damped while the quality of the rendered view is retained. The processed depth maps can be coded at a reduced rate compared to unaltered data. Coding experiments show gains up to 0.5dB for the rendered view at the same bit rate. The new filter is adapted to the new renderer. Hence it performs an integrated optimization of the depth data that leads to higher coding gains. The second part of the deliverable describes the integration of a new video quality metric, the PSNR-HVS, to the JMVC Software for Multiview coding. Therefore, the software structure of JMVC, and in particular the distortion classes, the rate-distortion interface and the Macroblock- Encoding class of the encoder have been modified. Coding experiments have been carried out to evaluate the gains achieved by the modified rate-distortion process. Constant scaling factors for the Lagrange multiplier used in the rate-distortion optimization have been evaluated. A QPdependent correction factor for the Lagrange multiplier has been determined for the new video quality metric (NVQM). With the optimized Lagrange multiplier the rate-distortion optimization process leads to gains up to 1.6dB at high bit rates using the new video quality metric compared to an encoder using the SSD for optimization. The last part of the deliverable gives a summary of the stereo video formats and coding methods evaluated in the Mobile3DTV project. Results and advancements are pointed out. It can be concluded that with the current technology development level the CSV representation format using MVC and the Video plus depth format using MPEG-C Part 3 perform best as coding approaches for Mobile3DTV. 2

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Table of Contents 1 Introduction... 4 2 Video plus Depth Coding... 5 2.1 View Synthesis for Video plus Depth Coding... 5 2.1.1 Relationship between depth and disparity... 5 2.1.2 Implemented Renderer... 6 2.1.3 Evaluation of results... 9 2.1.4 Conclusion... 12 2.2 Reduction of irrelevant information from depth maps... 13 2.2.1 Proposed Method... 13 2.2.2 Evaluation of results... 18 2.2.3 Conclusion and Outlook... 22 3 Multi View Coding... 23 3.1 Software Encoder using a new Video Quality Metric... 23 3.1.1 New Video Quality Metric (PSNR-HVS)... 23 3.1.2 Rate-distortion optimization... 23 3.1.3 Rate-distortion optimization using the new VQM... 26 3.1.4 Conclusion and Outlook... 38 4 Overview of coding algorithms for mobile3dtv... 39 4.1 Overview of representation format and coding approaches... 39 4.1.1 Representation Formats... 39 4.1.2 Coding Approaches... 41 4.2 Evaluation and advancement of stereo video coding for Mobile3DTV... 43 4.2.1 Subjective Evaluation... 43 4.2.2 Mixed Resolution Stereo representation and coding... 44 4.2.3 Video plus Depth... 45 4.2.4 MVC... 47 4.3 Conclusion... 47 5 Summary... 48 References... 50 3

1 Introduction The deliverable is tripartite. The first part deals with the advances achieved in Video plus Depth coding. A rendering approach supporting sub-pixel accuracy and a filter for removing irrelevant signal parts from depth data are presented. The second part describes a software encoder using a new video quality metric. And third, a final summary on coding algorithms for mobile 3DTV is presented in the last part. For the Video plus Depth approach a low complexity view synthesis algorithm suitable for mobile devices has been developed and implemented. For fast processing each row of the synthesized view is rendered using data of a line of the corresponding video and depth frame sequentially. This minimizes the amount of needed memory as well as the number of memory accesses. The implemented renderer provides two different modes. The first mode enables very fast processing by rounding disparities to integer values to avoid interpolation. The second mode supports floating point disparities by interpolation at sub pixel positions. For both modes pre-processing filters for the depth and post-processing filters for the rendered view have been implemented. The renderer supports different data formats for disparity, e.g. inverse depth data or scaled disparities. The new rendering approach is presented in section 2.1. A method for removing irrelevant information from depth maps in Video plus Depth coding is presented in section 2.2. The depth map is filtered in several iterations using a diffusion approach. In each iteration smoothing is carried out in local sample neighborhoods, considering the distortion introduced to a rendered view. Smoothing is only applied when the rendered view is not affected. Therefore irrelevant edges and features in the depth map can be damped while the quality of the rendered view is retained. The integration of a new video quality metric, the PSNR-HVS, to the JMVC Software for Multiview coding is described in section 3. For this, the software structure of JMVC was updated. In particular, the distortion classes, the rate-distortion interface and the Macroblock-Encoding class of the encoder have been modified. Coding experiments have been carried out to evaluate the gains achieved by the modified rate-distortion process. Constant scaling factors for the Lagrange multiplier used in the rate-distortion optimization have been evaluated. A QP dependent correction factor for the Lagrange multiplier has been determined for the new video quality metric (NVQM). Section 4 gives an overall project summary of the stereo video formats and coding methods evaluated in Mobile3DTV. Results and advancements are pointed out. Author acknowledgement: Section 2 on the advancement of the Video plus Depth approach, section 3 on the integration of the new video quality metric and the final overview on the achievements of coding approaches in section 4.2 have been authored by Gerhard Tech. Lina Jin provided section 3.1.1 on the new video quality metric and Philipp Merkle section 4.1 on coding approaches and representation formats. Karsten Müller and Heribert Brust assisted in the overall compilation of the deliverable. 4

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 2 Video plus Depth Coding 2.1 View Synthesis for Video plus Depth Coding For the Video plus Depth approach a low complexity view synthesis algorithm suitable for mobile devices has been developed and implemented. For fast processing each row of the synthesized view is rendered using data of a line of the corresponding video and depth frame sequentially. This minimizes the amount of required memory as well as the number of memory accesses. The implemented renderer provides two different modes: The first mode enables very fast processing by rounding disparities to integer values to avoid interpolation. The second mode supports floating point disparities by interpolation at sub pixel positions. For both modes pre-processing filters for the depth and post-processing filters for the rendered view have been implemented. The implemented renderer supports different data formats for disparity, e.g. inverse depth data or scaled disparities. 2.1.1 Relationship between depth and disparity Fig. 1 Relationship between disparity and depth in parallel pin-hole setup The implemented renderer supports the synthesis of rectified views from an input view and its depth map. Hence the rendered view is generated as parallel second view, i.e. as shot with a camera with its optical axes in parallel and rotation parameters equal in comparison to the original camera. These constraints simplify the rendering process to a shift of the pixels of the first view by disparities retrieved from the depth map. The relationship between depth and disparity is depicted in figure 1. The point P is shot by camera 0. Its image in camera 0 is the point X. To generate its corresponding point X in virtual camera view 1 X must be shifted by the disparity. Using (1) 5

with denoting the virtual camera distance, denoting the focal length of the camera and denoting the horizontal distance of from the center of the stereo camera pair the disparity can be computed as (2) 2.1.2 Implemented Renderer Fig. 2 Basic processing steps of the renderer, dashed steps are optional The implemented renderer supports the warping of the samples of a left view to render a right view. The distance to shift the sample positions is given by a depth map for the left view. Figure 2 shows the basic processing steps of the renderer. Steps that are optional are marked with dashed lines. The renderer supports different scaled depth and disparity data. In step (1) this formats are converted to disparities. A possible pre-processing of disparity data is carried out in step (2). The warping of the input samples is done in step (3). Here two different approaches are possible: a simple integer warping and an interpolated warping using interpolation at sub-pixel positions. In step (4) hole filling is carried out. An optional post-processing can be done in step (5). 2.1.2.1 Input data formats The renderer supports scaled depth maps as well as scaled disparity maps. The data must be provided as 8-bit integer YUV data. 6

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Scaled depth maps Scale depth maps are for example used by MPEG [1] with as focal length of the camera, the baseline of the camera pair, and and as minimal and maximal depth of the depicted scene. Scaled disparities Disparities are reconstructed by rescaling the given data using equation (4) with denoting a scaling factor and representing a disparity offset. Assuming an approximately equally distributed disparity range for all sequences and can be fixed to constants. With this approach an additional transmission of scaling data can be omitted. Moreover a fast implementation using the constant values is possible. However with fixed values an optimal usage of the 8-bit depth range is not longer assured. A further possibility that is also applicable for scaled depth maps is to keep the virtual baseline variable. By choosing a suitable the user of the mobile device could be enabled to select a subjective optimal depth impression. 2.1.2.2 Pre-processing of depth data Pre-processing of depth maps by applying low pass filtering can significantly reduce artifacts in the rendered view [2]. The proposed rendering algorithm supports binominal filtering of the input depth map with selectable number of taps. The filter is separated in horizontal and vertical direction. Note that the filtering is the only operation of the implemented renderer that is not performed in row direction only, hence an increased number of memory accesses is caused. 2.1.2.3 Sample warping The implemented renderer supports two modes to render the right or second view. One is fast warping mode without interpolation the other is the interpolated warping using sub-pixel accuracy. Fast warping The warping method is e.g. presented in [2]. is used to identify the sample position in the right view that is set to. (6) For the case of left foreground object edges ( ) occlusion will occur by shifting some samples backward. Hence the mapping from the left view to the right view is not unique. A background and foreground value will be mapped to the same position. However by processing input samples from left to right it is assured that the foreground sample value will be assigned last to. At right foreground object edges ( ) values from the left view are not assigned to all positions of the right view, hence disocclusions occur. To track such holes a binary map is generated while warping, indicating the positions already filled by the left view samples. Interpolated warping An advantage of the fast warping approach is that the samples of one line of the input view can be processed subsequently. Thus the number of memory accesses is minimized. The idea of the interpolated warping mode is to keep this sequential processing and to incorporate an interpolation at sub-pixel positions to the warping process instead of rounding. (4) 7

In the first step, position and are calculated with sub-pixel accuracy using equation (7). and are the correct unrounded positions of the shifted samples at and. After that, the difference is evaluated in the second step. If is true a left foreground edge starts between and that occludes the background and it is tested if the edge value of is extrapolated to the left. If, a right foreground edge ends between and. Here, a disocclusion occurs and it is tested if the edge value is extrapolated to the right. In the other cases the sample values at integer position in the target grid of are interpolated. Left foreground edge In case of a left foreground edge it is evaluated if the distance between integer sample position is smaller than. For this case the edges value is extrapolated to the left. (7) and the previous left (8) (9) Right foreground edge In case of a right foreground edge it is evaluated if the distance between integer sample position is smaller than. and the next right (10) For this case the edges value is extrapolated to the right. (11) Interpolation All integer sample positions between and are linearly interpolated using equation (12). Similar to the fast warping approach holes in a rendered view are tracked by a binary map indicating the positions already filled by the left view samples. This map is generated during warping. 2.1.2.4 Hole filling Holes in the warped view emerge from disocclusions or rounding disparity values. For hole filling a simple straight forward background extension process is used. Sample positions marked as unfilled in the binary map are filled by extrapolating the value of the background object. When rendering from left to right the background object is always located right to a disocclusion, therefore the value of the next warped sample can be used for filling. 2.1.2.5 Post-processing Errors in the depth map can lead to single pixel error in the rendered view known as boundary noise. To fill the missing pixel a three tap median filter in row direction can be applied. (12) 8

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 2.1.2.6 Color planes The implemented renderer supports direct YUV 4:2:0 processing as described in [2] as well as RGB 4:4:4 processing. However processing in RGB color space is more complex since it requires up sampling of the U and V channel as well as the color space conversion. 2.1.3 Evaluation of results 2.1.3.1 Effects of pre-filter Figure 3 shows a comparison between a view synthesized from unprocessed depth data (a,c) and from depth data with the binominal filter (b,d). It can be seen that the pre-processing significantly reduces the artifacts on the right side of the sunshade. However the 3D impression of the stereo view is affected as well since the low pass filtering extents the edges of foreground objects. Fig. 3 Synthesized views; (a) from unfiltered depth data, (c) detail view; (b) from pre-filtered depth data, (d) detail view; 9

Fig. 4 Comparison of fast warping (a,c,e,g) and interpolated warping (b,d,f,h); (a,b): synthesized views of sequence horse; (c,d): a detailed cutout; (e,f): histograms of effectively used disparities; (g,h) effectively used disparity maps 10

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 2.1.3.2 Comparison of fast and interpolated warping Figure 4 depicts a comparison of the fast and the interpolated warping mode. Figures 4 (a,c,e,g) are related to the fast mode and figures 4 (b,d,f,h) to the interpolated mode. The views synthesized with different warping modes comprise two major differences. One is related to the texture data and depicted in figure 4 (a-d). The other difference is in the depth impression and is indicated in figure 4 (e-h). Figure 4 (c) shows artifacts originating at the left side of a foreground objects edge. Due to rounding, some samples in the foreground (horse leg) have not been filled with the foreground object sample values but rather with values from samples of the background. Note that such artifacts are not holes and cannot be filled by the hole filling process. In the interpolated mode as depicted in figure 4 (d) these artifacts do not emerge. Reason for this is the continuous interpolation between two warped samples (equation 12). The second difference between the fast and the interpolated warping mode is depth impression in the stereo view. In the fast warping mode disparity values are rounded to the next integer position (equation 5). The rounding enlarges the quantization step size of the disparity values. The histogram of effectively used disparities and the corresponding depth map are shown in figure 4 (e,f). It can be seen that the depth data is quantized to several different layers. These layers are also visible in the rendered view. Although the layers have minor influence on the depth quality if the image content consists of a stack of objects at particular depths, they can be annoying if e.g. a plane is given reaching from the foreground to the background. With the interpolated warping mode using sub pixel accuracy the layering is reduced to the quantization step size given by the input depth data as shown in figure 4 (f,h).the advantage of reduced layering is of course only given if the input depth data provides sub-pixel disparities. In these cases PSNR gains up to 1 db have been found for some sequences. Sub pixel disparities are given if the depth estimation for a sequence has been carried out for the full-scale sequence before down-sampling to mobile display size. 2.1.3.3 Effects of the post-filter Figure 5 shows a comparison between a rendered view not post processed (a,c) and a rendered view post processed with the 3-tap median filter (b,d). The renderer was set to the fast warping mode. As explained in section 2.1.3.2 some samples have not been filled with values from background samples. It can be seen that the post-processing significantly reduces these artifacts. A disadvantage of this approach is decreased sharpness attained by median filtering. However, due to binocular suppression effects this loss of sharpness is subjectively reduced, when watching the stereo sequence. 11

Fig. 5 Synthesized views; (a) from unfiltered depth data, (c) detail view; (b) from post-filtered rendered view, (c,d) detail views; 2.1.4 Conclusion A renderer suitable for low-complexity rendering for mobile devices has been presented. The renderer supports different input data formats and incorporates two modes for sample warping. The fast warping mode for minimal computational complexity is suitable for full pixel accurate disparities. The more complex interpolated warping mode allows rendering with sub-pixel accurate disparities and provides an improved depth impression. Hole filling is carried out using line-wise background pixel filling. Post- and pre-processing filters for the input depth and the synthesized view have been implemented to reduce artifacts. 12

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 2.2 Reduction of irrelevant information from depth maps This section presents a method for further improvement of the Video plus Depth approach for stereo video. The depth map is optimized regarding the synthesis of a second view in stereo distance. The basic idea of the proposed method is that some signal parts of depth map created by a depth estimation algorithm are irrelevant for rendering. A removal or damping of these high frequency parts will increase coding efficiency and lead to an improved overall quality. Therefore the proposed algorithm applies a diffusion process to the depth data considering the distortion introduced to a rendered view. Diffusion filtering has been proposed by Perona and Malik [3] and has already been applied to depth maps ([4],[5]). In contrast to the proposed method the approaches presented in [4] and [5] use edge information from the video for depth map enhancement. The proposed approach and its single steps are presented in section 2.2.1. An evaluation of the approach is given in section 2.2.2. Finally section 2.2.3 provides the conclusion and an outlook. 2.2.1 Proposed Method The main concept of the proposed approach is the smoothing of the depth map in small steps and multiple iterations. In each iteration, all samples of a frame are processed consecutively. The smoothing applied to a sample is controlled by the error introduced to the rendered view, as depicted in figure 6. Here and denote the coordinates of a sample in the frame and represents the iteration number. Fig. 6 Iteration steps of the proposed method; Subsequent to the calculation of smoothed candidate values each sample is evaluated to determine if its candidate value is used in the processed depth map. An iteration of the proposed method starts with the calculation of a depth map with smoothed depth values candidates from the input depth map using a diffusional approach. Subsequently, all samples are processed successively to evaluate the obtained depth value candidates. The order in which the samples are processed has an influence on the filtering result. To minimize this influence the order is permuted for each iteration. 13

The depth map representing the current state of the processing is denoted. First is initialized with. While processing, is true for samples at positions that already have been processed in iteration and is true for samples that have not been processed. At the end of iteration, is equal to for all. The decision if a depth candidate at position is accepted is based on the error introduced to the view rendered from when changing from to. If the introduced error is below a threshold, the candidate value is accepted. Otherwise, the sample remains unchanged. The iterative filtering process can be terminated when the filter output converges. In the following sections the single aspects of the proposed method are discussed in detail. 2.2.1.1 Diffusion Filtering Smoothing is carried out using an approach similar to the diffusion process proposed by Perona and Malik in [3]. In [3] an image is smoothed by addition of its locally weighted Laplacian. A weighting (diffusion) coefficient is determined from the images gradient. It can be shown that this approach is similar to Gaussian smoothing for constant diffusion coefficient and multiple iterations. For the proposed method the diffusion process is modified to (13) with denoting the 4-nearest-neighbors discrete 2D-laplacian operator and denoting the quantization step size of the depth data. Thus the depth value of a sample converges to the mean of its horizontal and vertical neighbor samples with step size. Reason for the modification is the decision step. In this step a large change of depth might be rejected due to a large introduced error in the rendered view, whereas multiple small changes attained by equation (13) might be allowed. Nevertheless, a smaller change per iteration increases the total number of required iterations. 2.2.1.2 Error Calculation Figure 7 depicts the error calculation process. The error introduced by a change of a depth value from to at sample position is estimated as follows: Create a depth map with (14) Hence only the value of the sample under evaluation at position is changed whilst all other depth samples retain their current values. 14

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Render the output view using and. denotes the input video data. is the reference view. Rendering of this view must only be carried out once. Render the output view using and.rendering using the altered depth map must be carried out for each sample and iteration. Nevertheless, computational complexity is low since only image parts influenced by the sample at position must be re-rendered. Set is the maximum squared error between the image rendered from processed depth data and the reference image.the proposed approach uses the subpel accurate rendering method presented in section 0. This method shifts the samples of the view using the disparity calculated from the depth values and interpolates the sample values at positions of the target grid. Disocclusions are filled using a straight forward line wise extrapolation of the boundary background sample value. as well as are rendered using the coded video. This approach enables a stronger smoothing of depth data, since details and noise removed from the video data by coding are neglected when calculating the error introduced by the modified depth map. (15) 15

Fig. 7 Error Calculation step of the proposed methods; An intermediate depth map is created from the current map and one depth value of the current map; The view rendered from this depth map is compared to the reference view. 2.2.1.3 Decision Step In the decision step the introduced error is compared to a given threshold. determines the maximal allowed error for a sample in the rendered view. If is higher than the diffusion step is rejected. This is summarized in equation (16). In the scope of these experiments, only the removal of irrelevant information from the depth data is targeted. Hence the threshold is set to. A higher threshold enables stronger smoothing but also leads to an impaired rendered view. 2.2.1.4 Iterative processing As stated before the diffusion of the depth map is performed by processing the samples in succession. The order in which the samples are processed has an influence on the filtering result. To minimize this influence the order is permuted for each iteration in a way that the distance between two consecutively processed samples is maximized. The iterative filtering process can be terminated when the filter output converges, e.g. when the difference between and is below a threshold. Experiments show that approx. 100 iterations are enough to obtain a good smoothing result. (16) 16

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Fig. 8 Sequences Champagne Tower (left) and Book Arrival (right); (a), (d): Video Data; (b), (e) unprocessed depth; (c), (f): processed depth; Note that rendering using the unprocessed and processed depth provides an identical result, since only irrelevant signal parts have been reduced. 17

2.2.2 Evaluation of results 2.2.2.1 Diffusion process Figure 8 shows results of the proposed diffusion filter. A frame from the sequence Champagne Tower is depicted in Figure 8 (a). The sequence is downscaled to a size of 320x240 samples what is typical for e.g. mobile 3D TV displays. The corresponding unprocessed depth map is shown in figure 8 (b). It can be seen that the depth in the background is very noisy. Although this noise is irrelevant for rendering and does not affect the rendering process, it leads to higher data rates when compressed with a conventional encoder. The depth map processed with the proposed algorithm is presented in 8 (c). Here an error threshold of has been used, thus rendering with the processed and unprocessed depth data results in the same synthesized view. Nevertheless the noise in the background and also on the table in the foreground is removed, while edges in depth map are retained, that are important for correct rendering. A region clipped from the sequence Book Arrival can be seen in figure 8 (d). The full-sized sequence has a resolution of 1024x768 pixels. The unprocessed depth data shown in figure 8 (e) is currently used in MPEG exploration experiments [6]. For processing the threshold has been set to. It can be seen that irrelevant edges are removed by the proposed method. Figure 8 (f) shows that diffusion has been carried out to the left of foreground edges (marked green) and in regions with homogeneous video texture (marked blue). The reason for this filtering behavior is depicted in figures 9 and 10. Figure 9 depicts schematically the reason for diffusion in homogeneous texture regions for one row of the input data. Figure 9 (a) shows the video samples and figure 9 (b) their disparity values. For the unprocessed case a depth peek is shown, that can be regarded as noise. The shift conducted in the warping process is depicted in a - space in figure 9 (c). denotes the disparity and denotes the horizontal position of a sample. In the warping process samples can move horizontally on lines defined by with as original sample position. Note that samples with a positive disparity are in the foreground here. The final rendering result can be seen in figure 9 (d). It consists of the values of the foreground samples. The resulting gaps have been interpolated by hole filling from neighboring sample values. On the right side of figure 9 the rendering of processed depth data is presented. The peak of depth data has been smoothed out in figure 9 (f). However the rendering result shown in figure 9 (h) is the same as for the unprocessed data and the error determined by equation (15) is zero. The reason for diffusion to the left of foreground objects is depicted in figure 10. The input data contains an edge in the video data (figure 10 a) as well as in the unprocessed depth data (figure 10 b). The rendering result for the unprocessed data is depicted in figure 10 (d). After applying the diffusion filter the disparity data has been smoothed next to the left side of the foreground object as shown in figure 10 (f). However the sample belonging to the changed disparity value belongs to the background and is occluded. Hence the rendering result shown in figure 10 (h) is the same as for the unprocessed case. Please note that for the ChampagneTower sequence the diffusion in occluded regions has been disabled. The reason for this is described in the next section. 18

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Fig. 9 Diffusion in homogeneous regions; Rendering the video samples (a), (e) using the unprocessed (b) and processed depth data (f) leads to same results (d) and (h) Fig. 10 Diffusion in homogeneous regions; Rendering the video samples (a), (e) using the unprocessed (b) and processed depth data (f) leads to same results (d) and (h) 19

2.2.2.2 Diffusion in occluded regions As shown before the value of depth samples belonging to occluded regions are not important as long as the samples stay in the background. Hence a change of these samples will result in an error and a strong smoothing of depth data next to a foreground objects edge occurs. Although this smoothed area does not impair edges in the rendered view for the uncoded case, it was found that impairments can occur after coding. This is caused by the block-partitioning applied in rate-distortion optimization process carried out by the encoder. For an edge smoothed to one side usually a large block size is chosen, while for sharp edges a large block is further subdivided. This effect is depicted in figure 11. In some cases the depth value of important foreground samples is better preserved in a small block in the subsequent transform and quantization steps. To avoid impairment by the changed block partitioning smoothed sample values can be rejected for all occluded samples in the decision step. Fig. 11 Unprocessed (a) and processed (b) depth data; samples important for rendering are marked red; for the unprocessed data a smaller block size is chosen 2.2.2.3 Coding Results To evaluate the impact of the proposed method on compression efficiency coding experiments have been carried out. The video and depth data of sequences Champagne Tower and Book Arrival have been coded using the H.264/AVC Reference Software JM. The encoder has been configured to use main profile with hierarchical B-pictures, a GOP size of 8 and an intra period of 16. The depth data has been filtered with the proposed approach using the video data coded with a QP of 30 for generation of the rendered reference view. For Champagne Tower diffusion in occluded regions was disabled for Book Arrival not. Then the processed and unprocessed depth maps have been coded. The views rendered from the coded video and coded unprocessed and coded processed depth have been compared to the view rendered from uncoded and unprocessed video and depth data. The results are depicted in figure 12. Here the PSNR obtained by this comparison is plotted versus the bit rate used for depth data. Note that the maximal PSNR is also limited by the impairment caused by the coded texture. It can be seen that gains up to 0.5dB can be achieved with the proposed method. 20

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Fig. 12 Coding results for sequences Champagne Tower (a) and Book Arrival (b); PSNR Y of the rendered view vs. bit rate of the depth map. The view rendered from uncoded and unprocessed data is used as reference. 21

2.2.3 Conclusion and Outlook A diffusion algorithm for the enhancement of depth maps in Video plus Depth coding has been presented. The diffusion process is controlled by the distortion introduced in the rendered view regarding the rendering algorithm and the coded video data. Hence only irrelevant high frequency parts are damped. Resulting depth maps can be coded at lower bit rates while providing the same quality in the rendering process. The applicability of the approach has been demonstrated for two sequences. PSNR gains up to 0.5dB have been shown using a view rendered from uncoded and unprocessed data as reference at the same bit rate of the rendered view. The proposed approach can be advanced in several ways: An optimization and evaluation using original views instead of rendered views as reference promises higher coding gains as presented here, since not only signal parts irrelevant for rendering but also signal parts introducing noise to the rendered view can be reduced. Possible extensions regarding the diffusion process are anisotropic diffusion filtering and diffusion in temporal direction. Moreover an adaptation to Multi View plus Depth data (MVD) is imaginable. 22

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 3 Multi View Coding 3.1 Software Encoder using a new Video Quality Metric This section presents the integration of a new video quality metric (NVQM) into the JMVC Software for Multi View Coding [7]. Section 3.1.1 introduces this metric. In section 3.1.2 the basics of the rate-distortion optimization are discussed as also implemented in the JMVC Software. The integration of the NVQM into the JMVC Software is described and evaluated in section 3.1.3. Finally section 3.1.4 provides the conclusion and gives a suggestion for the integration of a new stereo video quality metric (NSVQM). 3.1.1 New Video Quality Metric (PSNR-HVS) PSNR-HVS proposed in [8] is a full reference image quality metric. It supplies an algorithm for computing the PSNR while taking into account the peculiarities of the human visual system HVS, thus the abbreviation - PSNR-HVS. Many studies have confirmed that the HVS is more sensitive to low frequency distortions rather than to high frequency ones. It is also very sensitive to contrast changes and noise. The PSNR-HVS removes the mean shifting and the contrast stretching. The modified version of PSNR utilizes the decorrelation properties of block DCT and the effect of individual DCT coefficients on the overall perception. More specifically, the modified PSNR is calculated as: (17) where is calculated taking into account the HVS features as follows Here, and denote image size, is a normalization factor, are DCT coefficients of an image block for which the coordinates of its left upper corner are equal to and, are the DCT coefficients of the corresponding block in the original image, and is the matrix of correcting factors [8]. PSNR-HVS-M is designed based on PSNR-HVS taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions [9]. The model operates with the values of DCT coefficients of pixel block of an image. For each DCT coefficient of the block the model allows to calculate its maximum distortion that is not visible due to the between-coefficient masking. PSNR-HVS-M assumes that the masking degree of each coefficient depends upon its square value (power) and human eye sensitivity to this DCT basis function determined by means of the Contrast Sensitivity Function (CSF). Several basis functions can jointly mask one or few other basis functions. Then, their masking effect value depends on the sum of their weighted powers. PSNR-HVS-M reduces the value of contrast masking in accordance to the proposed model [9]. The two metrics have been modified to work for both and block sizes by adjusting the masking coefficients in the calculation of MSE. The availability of blocks allows using the metric for macro-blocks of H.264/MPEG-4 AVC encoders. 3.1.2 Rate-distortion optimization In this section the rate-distortion optimization carried out by the JMVC Software is introduced. Section 3.1.2.1 gives an overview of the encoding modes available in the H.264/MPEG-4 AVC standard. The rate-distortion optimized selection of one of these modes is discussed in sections 3.1.2.3 and 3.1.2.2 23 (18)

3.1.2.1 Macroblock encoding modes and partitions The H.264/MPEG-4 AVC standard supports different modes to encode a macroblock [10], [11]. These modes provide different options to split the macroblocks in partitions and to predict these partitions. Whether a mode is available depends on the slice type, the position of the macroblock in the slice and the selected codec profile. Partitionings, possible for encoding a macroblock in an I-slice, are shown in figure 13. Note that the modes are only available for the high profile. The prediction is carried out using sample values from the boarders of already coded partitions located on the left, top, and top right of the current partition. The four prediction modes supported for the partitioning are depicted in figure (14. The direction of the prediction is indicated by the red arrows. Fig. 13 Macroblock partitioning for Intra coded macroblocks, note that the in the high profile. partitioning is only possible Fig. 14 The four prediction modes for partitions. Fig. 15 The nine prediction modes for partitions. 24

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Inter predicted macroblocks are encoded using a motion compensated prediction from one or more reference frames, that have been encoded prior to the current frame. To increase the prediction quality the macroblock can be split up in multiple partitions. A motion compensated predictor is estimated for each of the partitions. Possible partitions are depicted in figure 16. The partition can be further split into one, two or four sub-partitions. Moreover the H.264/MPEG-4 AVC standard provides a skip mode for macroblocks in P and B slices and a direct mode for macroblocks in B slices. For both modes the motion vectors are inferred from adjacent blocks. In the skip mode no residual data is transmitted. Fig. 16 Macroblock partitioning for Inter predicted macroblocks 3.1.2.2 Rate-distortion optimized mode selection Target of the rate-distortion optimized mode selection is choice of a macroblock partitioning and prediction mode to globally minimize the rate given a fixed distortion or to minimize the distortion given a fixed rate. Therefore a Lagrangian optimization is commonly used as for example described in [12]. The Lagrangian optimization targets the minimization of the rate-distortion functional defined as with denoting a mode under test. and represent the rate and the distortion obtained, when encoding a macroblock using mode. The rate-distortion optimized selection is carried out by coding a macroblock in all possible modes and finally using the mode providing the minimal cost. The Lagrange multiplier controls the tradeoff between the rate and the distortion. An optimization of leads to a globally optimized rate-distortion characteristic. The Lagrange multiplier found to be optimal using the sum of squared difference (SSD) between the original data and the encoded macroblock was determined in [12] as with as quantization step size. Thus the optimal Lagrange multiplier depends on quantization step size. In the JMVC Software for Multiview Coding [7] is determined as with denoting the quantization parameter of the JMVC Software. With the approximation for the quantization step size it can be seen that as presented in equation (20). (19) (20) (21) 25

3.1.2.3 Rate-distortion optimized motion estimation When coding a macroblock using inter prediction, motion compensation is carried out to generate an optimal predictor. In the motion compensation process motion vectors are estimated using a rate-distortion optimized approach. By minimizing the functional with denoting the rate needed to code the motion vector. is the sum of absolute differences (SAD) between the predictor and the partition to encode. For using the SAD, the Lagrange multiplier must be set to [12]: 3.1.3 Rate-distortion optimization using the new VQM This section gives a brief overview on how the rate-distortion optimization is implemented in the JMVC Software and what changes had to be carried out to integrate the new Video Quality Metric. The performance of the new video quality metric for Intra coding as well as for Inter and Intra Coding are evaluated by coding experiments. Results from these experiments are used to optimize the Lagrange multiplier. Finally a QP dependent correction factor for the Lagrange multiplier is investigated. 3.1.3.1 Integration into the JMVC Software for Multiview Coding Overview of the encoding process of the JMVC Software The hierarchy of MVC encoding modules are depicted in figure 17. In the H.264/AVCEncoderTest class the encoder is initialized and frame buffers are setup. Subsequently the frames of the sequence are processed. In the CreaterH264AVCEncoder class further objects used in the encoding process are initialized. The PicEncoder class initializes the slice header, sets up the Lagrange multiplier depending on the used QP and finally starts the encoding of a frame or a field. The reference frames used by the current frame or field are set in the SliceEncoder class. Moreover the processing of slice groups and macroblocks is started. Single macroblocks are finally encoded using the MbEncoder class. (22) (23) Fig. 17 Structure of the encoding process in the JMVC Software. 26

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Structure of rate-distortion optimization in the JMVC Encoder Fig. 18 Hierarchy of rate-distortion optimization search The structure of the rate-distortion optimization carried out in the MacroBlockEncoder module is shown in figure 18. The search for the mode that provides the minimum rate-distortion cost is carried out hierarchically. At the highest level the possible partitioning sizes as given in section 3.1.2.1 are tested. The Skip Mode is tested for P-Slices only. The Direct Mode is only tested for B-slices. Inter Prediction is carried out for P- and B-slices and intra prediction for I-, P- and B- slices. Four or nine prediction modes are tested for Intra coded macroblocks. For each inter coded macroblock partition a search is carried out to find an optimal reference frame and motion vector. For inter coded macroblocks further subdivisions are tested. Moreover a transform size of is evaluated for the inter coded macroblocks, when using the high profile (EstimateMb8x8Frext). The search process performed in inter coding is depicted in figure 19 and equation (20). The JMVC Software supports different video quality metrics at different levels of the rate-distortion optimization process. In blocks marked orange or red in figure 18 the sum of squared errors is used (SSE) to determine the distortion. In the calculation of the rate-distortion cost is 27

used. To determine the video quality in the motion estimation process JMVC provides a choice of block difference calculations, i.e. between SAD, SSE, HADAMARD and SAD-YUV for full pixel accurate motion estimation (marked green) and a choice of SAD, SSE, HADAMARD for subpixel accurate motion estimation (marked blue). If the SSE is used in motion estimation, is used for computation of the rate-distortion costs. Otherwise, (equation 23) is used. The computation of the distortion is realized in the XDistortion class of the JMVC Software. This class provides member functions for distortion computation for different block sizes. To choose a VQM, a parameter can be passed to these functions. A pointer to a function implementing the selected VQM for the particular block size is then selected and called. The XRateDistortion class of the JMVC Software provides functions to calculate the ratedistortion cost given a particular distortion and rate. The Lagrange multiplier for mode selection as well as for motion estimation is a member of this class. Fig. 19 Inter Search of JMVC Encoder Changes to the JMVC Encoder Functions to compute the new metric as given in equation (18) for different block sizes have been added to the XDistortion class of JMVC Software. The PSNR-HVS as well as the PSNR-HVS-M can be used in the rate-distortion optimization. However the encoder has been optimized for PSNR-HVS. Hence in the following, the new video quality metric refers to the PSNR-HVS. The minimum block size in H.264/AVC is samples. This requires a computation of the new distortions metric for blocks in the rate-distortion optimization. To have a consistent metric it was chosen to calculate the distortion for a larger block by summing the distortions of its sub-blocks. A new parameter VQMMode has been added to the encoder configuration to decide if and to what extend the new video quality metric is used. Depending on that parameter the quality metric is chosen at different levels of the hierarchically search process. Three different settings are possible: The first setting is to use the new metric for I-frames only (orange blocks in figure 18). The second setting allows the usage of the new metric for intra and inter mode decision (orange and red blocks in figure 18 and figure 19). The third setting enables the use of the new video quality metric for intra and inter mode decision as well as for motion estimation. An optimization of the Lagrange multiplier is presented in [12] and gives equation (20). However this optimization has been carried out for SSE. To find an optimal Lagrange multiplier for the new video quality metric two additional parameters have been included to the JMVC encoder configuration. These parameters are called LambdaScale and LambdaScaleME. They are members in the RateDistortion class and scale the Lagrange multiplier. The linear scaling has been used based on the assumption that the proportionality between the squared quantization step size and the optimal Lagrange multiplier as given in equation (20) is still valid for the new video quality metric. 28

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 3.1.3.2 Evaluation of the new VQM in the rate-distortion optimization process The performance of the new video quality metric in the rate-distortion optimization process has been evaluated for the different cases that can be set by the VQMMode parameter. An overview of the setups is depicted in table 1. NVQM denotes the new video quality metric. To show the gains achieved by optimizing to the new video quality metric, encoding with conventional distortion metric (SSE, SAD) has been carried out for reference. The test sequences Car, Horse, Butterfly, Mountain, Soccer2 and Bullinger from the coding test set of the stereo video database [13] have been used. Sequences have been coded with varying the quantization parameter of the encoder from 20 to 42 with a step size of 3. To evaluate the influence of different Lagrange multipliers, the LambdaScale parameter has been set to 0.25, 0.5, 0.75, 1 and 2. For the inter encoding test the period of I-frames has been set to 8 and an IPPP GOP-structure has been chosen. Intra mode decision Tab. 1 Encoder setups used for evaluation Results for the intra encoding setup are depicted in figure 20. The sequences have been encoded using I-frames only. The black curve shows results for the reference encoding setup. It can be seen that using the new video quality metric in rate-distortion optimization process leads to gains for all sequences. Gains increase for higher bit rates. A maximum gain of can be achieved for the Mountain sequence. An evaluation of the influence of the LambdaScale parameter shows that for high bit rates a low LambdaScale is optimal. In contrast to this a high LambdaScale is optimal for low bit rates. Hence the assumption of a linearity between the optimal Lagrange multiplier and the determined by equation (21) does not hold any more. Intra and Inter mode decision Figure 21 shows the results for the Inter and Intra configuration as given in table 1. Note that for this mode decision, the LambdaScaleME estimation was set to and differs from the LambdaScale parameters. Conclusions here are similar to those found for the Intra mode only. The optimal LambdaScale decreases for increasing bit rates and gains increase for higher bit rates. Intra and Inter mode decision and motion estimation Results for the Intra, Inter and Motion Estimation configuration are shown in figure 22. The LambdaScaleME parameter is equal to the LambdaScale parameter. It can be seen that gains decrease by using the new video quality metric in the motion estimation process. This might be due to the energy of the residual attained with the new video quality metric. Using the new quality metric in the rate-distortion optimized motion estimation leads to a predictor minimizing the new video quality metric but not minimizing the energy of the residual signal. The additional rate used to encode the residual data might lead to the observed decrease of the overall performance. 29

Fig. 20 Evaluation of Intra configuration only; average NVQM-Y of both views vs. total bit rate 30

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Fig. 21 Evaluation of Inter and Intra configuration ; average NVQM-Y of both views vs. total bit rate 31

Fig. 22 Evaluation of Inter and Intra and ME configuration; average NVQM-Y of both views vs. total Bit rate 32

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 3.1.3.3 Optimization of the Lagrange multiplier The evaluation of encoding results shows that the assumption of a linearity between the optimal Lagrange multiplier and the as attained from equation (21) does not hold. The encoding process cannot be optimized by selecting a constant scale for independent from QP. In [12] the relationship between Lagrange multiplier and the QP is determined by fixing and multiple encodings of a macroblocks with different QPs to find the optimal combination of QP and. This approach could be redone for the new video quality metric. However, for simplicity reasons only an optimal correction factor (Lambda Scale) for is calculated here. For optimal encoder performance must be corrected by a factor depending on the QP value. The final Lagrange multiplier is then with from equation (21). The coding experiments provide the rate and the distortion for several combinations of the scaling factor and the. The optimal relationship between and can now be obtained by evaluation of several rate points. Given the set of combinations of and that produce the rate as the optimal combination at rate point is given by Hence, is the combination that minimize the distortion at rate. (24), (25). (26) Figure 23 depicts the optimization procedure for the sequence Car. To determine and coding experiments have been carried out varying from to with a step size of 0.1 and range from to using a step size of two. Intermediate values have been interpolated. Sets of combinations leading to equal rates can be obtained from figure 23 (a). These isorate lines are marked black. The combinations minimizing the distortion by maximizing the NVQM can be found in figure 23 (b) on the iso-rate lines and are highlighted by red circles. 33

Fig. 23 Optimization of the Lambda Scale; (a), (b), ; the black lines mark points with equal rate; combinations leading to minimal distortions are marked red The relationship between and for different is depicted in figure 24 (b). The blue lines mark combinations leading to the same rate. Again, the combinations maximizing the NVQM from a particular set are marked red. This can be seen from 24(c). Here the NVQM-Y is plotted versus the Lambda Scale. As found before in section 3.1.3.2 the optimal converges to small values below for high rates and increases for low rates. Moreover it can be observed that has a minor influence on the distortion for low rates. The iso-rate lines run almost horizontal. The spike that can be observed in the NVQM- curve at low rates results from this. Due to the small change in the NVQM-Y noise introduced by the numerical solution of 34

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV equation (26) has a strong influence on the determined. However, since the changes in distortion is very small this effect has only a minor influence on the optimization of. For the sake of completeness the relationship between the QP and the NVQM-Y is depicted in figure 24 (a). Fig. 24 Optimization of the Lambda Scale ; (a) relationship between NVQM and QP; (b) relationship between QP and ; (c) relationship between Lambda Scale and NVQM; the blue lines mark points with equal rate; combinations leading to minimal distortions are marked red 35

Fig. 25 Relationship between Lambda Scale and QP for all sequences of the test set The optimal combinations of QP and have been determined for all sequences of the coding test set. Results can be found in figure 25. The figure shows that optimal relationship between QP and is sequence-dependent and varies up to 0.5. However the optimal increases for all sequences with increasing QP. An approximation of relationship that is depicted in figure 25 is e.g. Using equation (27) together with equation (24) and equation (21) leads to a changed Lagrange multiplier for the NVQM of (27) (28) 3.1.3.4 Evaluation of Results Coding results using the Lagrange multiplier as calculated from equation (28) are depicted in figure 26. Gains compared to the reference can especially be achieved at high rates and range up to 1.6 db. The correction of the Lagrange multiplier by the QP dependent allows to maximize the encoder performance compared to a correction with a constant. That can e.g. be seen for the horse sequence. Here the scale of 0.25 provides best results at 1450 kbit/s and a scale of at 600 kbit/s. With QP dependent the encoder operates optimal at both rate points. However, due to sequence dependency of the QP dependent scale the approximation is not optimal for all sequences. This effect can be observed e.g. for the mountain sequence at 1600 kbit/s. Here a scale of 0.25 would increase the bit rate. Additional to the evaluation using the NVQM coding experiments, evalutions using the PSNR for comparison to the reference have been carried out. An average decrease of -PSNR has been found. 36

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV Fig. 26 Coding results attained with a QP dependent correction factor for the Lagrange multiplier 37

3.1.4 Conclusion and Outlook A new video quality metric, the PSNR-HVS, has been integrated into the JMVC Software for Multiview coding. Therefore the distortion classes, the rate-distortion interface and the Macroblock-Encoding class have been modified. Coding experiments have been carried out to evaluate the gains achieved by the modified rate-distortion optimization. Constant scaling factors for the Lagrange multiplier used in the rate-distortion optimization have been evaluated. A QP dependent correction factor for the Lagrange multiplier has been determined for the NVQM. With the optimized Lagrange multiplier the rate-distortion optimization process leads to gains up to 1.6dB at high bit rates using the new video quality metric compared to an encoder using the SSD for optimization. Since the new video quality metric has been designed to emulate the human visual system [8], a subjectively increased video quality can be assumed as well. The integrated 2D video quality metric is a first step towards an encoder using a stereo video quality metric. The new metric can be used to optimize the quality of the first coded view. The encoder for the second view could use rate-distortion optimization regarding an error calculated from the second view together with the already coded first view. The concept of this approach is depicted in figure 27. Fig. 27 Concept for an encoder using a new stereo video quality metric (NSVQM) In the first step the first view is encoded using the new video quality metric. In the second step, the second view is encoded using a new stereo video quality metric. This metric utilizes four inputs to the rate-distortion optimization process: the currently tested macroblock, the original second view, the original first view and the reconstructed first view. Further extensions could include a rate-distortion optimized quantization using the new metric and a optimization targeting blocks larger than. 38

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 4 Overview of coding algorithms for mobile3dtv This section gives an overview of the evaluated coding methods. The evaluations and results of this and the other Mobile3DTV deliverables on stereo video coding are summarized. 4.1 Overview of representation format and coding approaches Stereo video can be represented in different formats namely Conventional Stereo Video (CSV), Mixed Resolution Stereo (MRS) and Video plus Depth (V+D). These formats are depicted in figure 28 and are discussed in section 4.1.1. For the stereoscopic representation formats, different standardized coding methods exist. These coding methods are AVC Simulcast, AVC with Stereo SEI-Message, AVC Auxiliary Picture Syntax, MPEG-C part 3 with AVC and MVC. An overview can be found in figure 28 and in section 4.1.2. They can be applied to the representation formats. However, not each combination is practical. Reasonable combinations are listed in table 2. Tab. 2 Reasonable combinations of representation formats and coding methods are marked with + 4.1.1 Representation Formats These sections will provide a detailed description of commonly used representation formats. An initial graphical overview is given in figure 28 where three representation formats are shown. 4.1.1.1 Conventional Stereo Video Stereo video consists of a pair of sequences, showing the same scene for the right and the left eye view, as shown in 28 left. Compared to conventional monoscopic video, stereo video has twice the amount of data to be stored or transmitted. Especially for mobile video services with its bandwidth and memory limitations, very efficient compression of stereo video is required to realize 3D instead of conventional 2D video. However, efficient compression of stereo video takes advantage of the fact that the left and the right view of a stereo pair show the same scene from slightly different perspectives and are therefore highly redundant. For CSV, the representation format equals the display format, such that no conversion processing is required. 39

Fig. 28 Representation formats and their processing to common display format: Conventional Stereo Coding left, Mixed Resolution Stereo middle and Video+Depth right 4.1.1.2 Mixed Resolution Stereo A reduction of the transmission rate can be achieved by exploiting the binocular suppression theory [16]. In a stereo sequence, where the sharpness of left eye and right eye view differ ( see figure 28 middle), the perceived binocular quality of a stereoscopic sequence was rated close to the sharper view [17], [18]. In contrast, if both views exhibit different amounts of blocking artifacts, the binocular quality of a stereoscopic sequence was rated close to the mean quality of both views. This leads to the assumption that a stereoscopic sequence, in which one view has a reduced resolution (mixed resolution representation, MR) the same subjective quality in comparison to the full resolution (FR) case is perceived. Thus, a lower bit rate at equal quality is achievable for MR. For the conversion of MRS into the 2-view stereoscopic display format, post processing in the form of upsampling is required. Advancements achieved for the Mixed Resolution Stereo representation can be found in section 4.2.2. 4.1.1.3 Video plus Depth Representation The video plus depth format consist of a conventional monoscopic color video and an associated per pixel depth map (figure 28 right), which can be regarded as a monochromatic, luminanceonly video signal. Thus, a lower bit rate is achievable. The depth data is usually generated by depth/disparity estimation from a captured stereo pair. Such algorithms can be highly complex and are still error-prone. The advantage of this format is the possible baseline variation such that stereo pairs with baselines other than the original camera pair can be generated. This requires the most complex conversion method from representation to display format of the presented formats. Here, view synthesis is used to generate the second view from the V+D format. The major challenge is the visual quality of the synthesized view, as rendering artifacts may result in a wrong and thereby annoying 3D impression in case the left and right view are inconsistent. Advancements achieved for the Video plus Depth representation can be found in section 4.2.3. 40

MOBILE3DTV D2.6 Final report on coding algorithms for mobile 3DTV 4.1.2 Coding Approaches Fig. 29 Coding approaches suitable for Mobile3DTV 4.1.2.1 AVC Simulcast A simple coding method for stereo content is H.264/AVC Simulcast. It is specified as the individual application of an H.264/AVC conforming coder to several video sequences in a generic way [11]. H.264/AVC is the latest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). H.264/AVC has recently become the most widely adopted video coding standard and covers all common video applications ranging from mobile services and videoconferencing to IPTV, HDTV, and HD video storage. For stereo video the overview diagram in figure 29 illustrates the coding procedure of H.264/AVC Simulcast with the left and right view of a stereo pair. The H.264/AVC encoder is applied to each of the two input sequences independently, resulting in two encoded bit- or transport-streams (BS/TS). After transmission over the channel the two streams are decoded independently, resulting in the distorted video sequences of the stereo pair. AVC Simulcast can be applied to all representation formats. 4.1.2.2 AVC with Stereo SEI-Message According to the H.264/AVC standard [11], the Stereo video information SEI message is specified as follows: This SEI message provides the decoder with an indication that the entire coded video sequence consists of pairs of pictures forming stereo-view content. It defines six flags to control the mapping of frames or fields of the coded video sequence to the left and right 41