Wyner Ziv-Based Multiview Video Coding Xun Guo, Yan Lu, Member, IEEE, Feng Wu, Senior Member, IEEE, Debin Zhao, and Wen Gao, Senior Member, IEEE

Size: px

Start display at page:

Download "Wyner Ziv-Based Multiview Video Coding Xun Guo, Yan Lu, Member, IEEE, Feng Wu, Senior Member, IEEE, Debin Zhao, and Wen Gao, Senior Member, IEEE"

Janis April Stevenson
5 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE Wyner Ziv-Based Multiview Video Coding Xun Guo, Yan Lu, Member, IEEE, Feng Wu, Senior Member, IEEE, Debin Zhao, and Wen Gao, Senior Member, IEEE Abstract Utilizing video correlations among views would definitely improve multiview video compression in terms of coding efficiency, which usually requests an expensive system to collect videos from different cameras and jointly compress them. Thanks to recent developments on distributed video coding, this paper proposes a new multiview video coding scheme based on Wyner Ziv (WZ) coding technique, in which the complicated temporal and interview correlation exploration process is shifted from the encoder side to the decoder side so that broadband raw data traffic and high intensive computation for jointly encoding can be avoided. At the encoder side, a wavelet-based WZ scheme is proposed to compress video of every camera. Furthermore, in order to better utilize correlation in wavelet domain, all coefficients are organized as that done in SPIHT bit plane by bit plane. At the decoder side, a more flexible prediction technique that can jointly utilize temporal and view correlations is proposed to generate side information. Finally, experimental results show the proposed scheme significantly outperforms the conventional intra-frame coding for better random access and is even close to the inter-frame coding for better efficiency. Furthermore, compressed data is much robust when it is transmitted over an error-prone channel. Index Terms Distributed source coding, multiview video coding (MVC), wavelet, Wyner Ziv (WZ) video coding. I. INTRODUCTION IN RECENT years, multiview video systems have become more and more popular due to the adoption of the interactive multimedia applications such as 3-D television, surveillance and wireless sensor networks. In fact, some practical multiview video systems have been reported for research and application purposes, e.g., the multicamera array developed in Stanford University [1] and the real-time multiview video system developed in MSRA [2]. Interactivity is the main characteristic of multiview video system, which allows users to selectively watch some views or panoramic 3-D information. Due to large data volume, transmission of multiview video requires much Manuscript received January 15, 2007, revised September 23, This work was performed at Microsoft Research Asia, Beijing, China, and was supported in part by the National Science Foundation of China under Grant and Grant This paper was recommended by Associate Editor A. Smolic. X. Guo was with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin , China. He is now with MediaTek Inc., Beijing , China ( xun.guo@mediatek.com). Y. Lu and F. Wu are with the Microsoft Research Asia, Beijing , China ( yanlu@microsoft.com; fengwu@microsoft.com). D. Zhao is with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin , China ( dbzhao@vilab.hit.edu. cn). W. Gao is with the School of Electronic Engineering and Computer Science, Peking University, Beijing , China ( wgao@pku.edu.cn). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT more bandwidth than that of single-view video. Thus, how to efficiently compress multiview video has become a popular research topic. In the past years, various multiview video coding (MVC) techniques have been developed. Since the multiview video consists of video sequences captured by multiple cameras towards the same scenario but from different angles and locations, significant correlations may exist among views. In [3], an interview matching cost and pure geometrical constraint algorithm is used to estimate disparity and to identify the occluded areas in the views. In [4], a sprite generation algorithm in multiview sequence is proposed to improve coding efficiency. Recently, some efforts have been invested to develop the practical MVC schemes. In MPEG 3DAV group, an AVC-based MVC scheme with joint hierarchical B prediction and inter-view prediction was selected as the reference in the forthcoming core experiments [5]. A comparative study of different MVC prediction structures with regard to compression efficiency and complexity was presented in [6]. The study shows that most of gain of MVC compared with simulcast comes from inter-view prediction of anchor frames. Joint Video Team (JVT) of MPEG and ITU-T then selected the AVC-based MVC scheme as reference software, namely Joint Multi-View Video Model (JMVM). Since then, the performance of MVC has been improved largely. Besides the AVC-based MVC schemes, the MVC scheme based on a high-dimensional wavelet coding has also been proposed in [7] and [8]. Although the inter-view prediction does improve the coding performance considerably compared with the simulcast video coding, it still suffers from some shortcomings in practical applications. In fact, the above inter-view prediction methods are based on the following two assumptions when they are used in a real system. First, the video data from different views can be freely exchanged or simultaneously available at the encoder. Second, all cameras can work simultaneously and video sequences can be compressed and transmitted with low latency. However, the transmission channels between two cameras are usually unavailable in practice. Moreover the high computing complexity is also a big burden for a practical multiview video capture system. Therefore, it is hard for the previous MVC schemes to be used in camera arrays even though they have good performance. Then a question arises here. Is there any way to separately encode each view while the coding performance still remains as good as the jointly encoding? In theory, distributed source coding (DSC) can provide a solution to this problem. The Slepian Wolf theory shows that even if correlated sources are encoded without getting information from each other, coding performance can be as good as dependently encoding if the compressed signals can be jointly decoded [9]. And later, /$ IEEE

2 714 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 Fig. 1. Structure of proposed WZ-based MVC scheme. Wyner and Ziv have extended the theory to the lossy source coding with side information at the decoder [10], which is more suitable to practical video coding. Recently, several practical Wyner Ziv (WZ) coding techniques have been proposed for video coding, namely distributed video coding (DVC). In [11] and [12], Praddhan and Ramchandran propose a DVC framework based on syndrome of codeword co-set. In this scheme, WZ frame (W-frame) is transformed using a block-wise discrete cosine transform (DCT) and the transformed coefficients are quantized with a uniform scalar quantizer. Then a mode decision strategy will decide whether a block is coded as a WZ block or as an intra-block. In [13] and [14], Aaron and Girod propose a DVC scheme using turbo codes in transform domain. A block-wise DCT is used in a 4 4 or 8 8 block size. The low-frequency coefficients are extracted into different subbands. Each subband is encoded using turbo codes independently, and the side information generated at the decoder by motion-compensated prediction is used to help the decoding and reconstruction of the W-frame. Thanks to the work on the practical WZ video coding schemes, we can further apply the distributed video coding into multiview video streaming system. As we know, WZ theory can be well used in distributed source coding field. However, it suffers from lower coding performance than traditional hybrid video coding. Why cannot the theoretical bound be reached in DVC? The main reason lies in the difficulty in estimating the correlation channel between the frame to be coded and the side information. In a practical distributed video coding scheme, side information has to be generated from adjacent frames of W-frame by using a motion-compensated interpolation algorithm. This process has to be done without current frame, and the generated prediction information may be not acceptable in many regions, especially those have high motion objects. Thus, the distribution model between W-frame and side information will not be accurate enough. MVC scenario can partially tackle this problem because inter-view correlations can also be utilized in addition to traditional temporal correlations. The first attempt is reported in [15] to provide a distributed compression system for large camera arrays. This paper presents a generic structure for MVC using WZ coding technology. In this scheme, two specific requirements of a practical multiview system are considered, i.e., encoding with low complexity and inter-view prediction. The multiview video is taken as a 2-D image matrix. Based on the predefined coding structure, each frame in the multiview video is independently encoded as either a traditional intra-frame (I-frame) or a W-frame. The basic idea has been presented in our previous work [16]. In this paper, two major extensions have been included to further improve the coding performance. First, a more flexible side information generation algorithm considering both temporal prediction mode and inter-view prediction mode is proposed to achieve high prediction accuracy. Before determining the prediction mode of pixels in W-frame, the prediction mode of their corresponding pixels in adjacent views are computed for references. Second, an efficient wavelet and SPIHTbased WZ coding scheme that has been presented in our previous work [17] is extended for the coding of W-frames. The good inherent characteristic of DWT and SPIHT can make DVC scenario more efficient in exploiting both spatial correlations and temporal correlations. The rest of this paper is organized as follows. Section II presents the whole structure of the proposed WZ-based MVC system. Section III discusses the wavelet-based WZ video coding scheme adopted by this paper in detail. In Section IV, a flexible side information generation algorithm is presented and the advantages are analyzed. Experimental results are given in Section V and some conclusions are drawn in Section VI. II. STRUCTURE OF WZ-BASED MVC SCHEME Fig. 1 shows the structure of the proposed WZ-based MVC system, in which each camera consists of the capture and encoding parts. Considering the scenario of low-cost camera arrays, the structure with low-complexity encoding and high-complexity decoding is employed. Captured multiview video frames are first encoded by the WZ or intra-encoder, and then transmitted to the decoder. The correlation exploitation module in the decoder jointly decodes the received I- and

3 GUO et al.: WZ-BASED MVC 715 W-frames by utilizing both temporal and inter-view correlations. Since the W-frames are intra-encoded and inter-decoded, the whole system consists of independent encoder and joint decoder. Therefore, the high coding performance can be achieved with the low encoding complexity by shifting correlation exploration from the encoder to the decoder. The proposed MVC scheme mainly has three advantages. 1) The communication between the different cameras can be removed. In the previous MVC schemes, inter-view correlation is exploited at the encoder through disparity-compensated prediction. However, in the practical applications, this kind of data exchange between cameras is very difficult. In the proposed system, W-frames are also independently encoded. Therefore, no communication has to be done between cameras. This advantage is very significant in the case of dense multicamera system. 2) Low computing complexity makes the multiview video data be transmitted with low delay. Real-time processing can be achieved because only I-frame and W-frame are encoded. Although the complexity of decoder will be increased by temporal and inter-view correlation exploitation, fast algorithms could be used in on-line decoding case. As for the offline decoding case which is more general for DVC scenario, the complexity is not a major concern. 3) The selection of the views that need to be decoded is more flexible. In the traditional MVC schemes based on hybrid video coding, the reference frames (including that from the neighboring views) are predecided during the encoding. Thus, all reference frames have to be decoded before the current frame no matter which view they are from. The proposed MVC scheme can avoid this redundancy, because the inter-view prediction is only done at the decoder and the decoding of adjacent views can be chosen freely. To favorite the above advantages, we predefine the coding structure of the WZ-based MVC scheme in Fig. 2. In this system, multiview video frames are classified into two categories: I-frames and W-frames, noted by and, respectively. I-frames are coded with the traditional intra-coding method (e.g., H.263+). W-frames are inserted between two successive I-frames. The number of W-frames can be adjusted according to the coding requirement. Side information, noted by, plays an important role in the decoding of a W-frame. As shown in the figure, each W-frame needs a side information, which is generated at the decoder through motion/disparity-compensated prediction. III. WAVELET-BASED WZ VIDEO CODING WZ video coding scheme is the key part in the proposed MVC scheme, which dominates the coding performance of the whole system. Considering the network application scenario of the proposed MVC scheme, the WZ video coding scheme should not only achieve high efficiency coding performance, but also be adaptive to the different bit rate requirements of networks or users. Based on this consideration, we choose the wavelet-based WZ video coding scheme as the core coding module. Besides the good performance which will be described in Section III-A, the inherent level-structure of wavelet transform can achieve Fig. 2. Coding process of the proposed MVC scheme. Fig. 3. WZ theory on source coding. spatial and quality scalability easily. When the bandwidth becomes low, high-frequency subbands can be discarded without being transmitted. A. WZ Theory on Video Coding WZ theory on lossy compression with side information at the decoder shows that for two correlated sources and,if is encoded independently and decoded with access to at the decoder as side information, and a distortion is acceptable, the coding rate of (i.e., ) can achieve the bit rate (i.e., ) required when is available at the encoder. As shown in Fig. 3, if is encoded independently and decoded together with, can be taken as the error version of transmitted through a virtual correlation channel. The typical model for WZ video coding scheme is shown in Fig. 4. When a W-frame is input into a WZ encoder, it is transformed and the coefficients are quantized through a quantizer firstly. After that, the quantized coefficients are encoded with a Slepian Wolf encoder. At the decoder, a prediction of (i.e., ) is used as the side information and helps to decode. Finally, the reconstruction of (i.e., ) is generated using the information of. B. Wavelet and Spiht-Based WZ Video Coding The proposed wavelet-based WZ video coding scheme is shown in Fig. 5, in which frames of the input video are classified into two categories: I-frames and W-frames. I-frames are coded with the traditional intra-coding method such as SPIHT or H.263+ intra-coding. The following problem is how to efficiently compress the W-frame. As shown in Fig. 5, at

4 716 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 Fig. 4. Practical WZ video coding system. Fig. 5. WZ video coding scheme based on wavelet and SPIHT. Fig. 6. Structure of the RSC code. the encoder, a DWT is applied to W-frame to generate coefficient set. Then, is reordered using a set partition process similar to zero tree generation in SPIHT [18]. In this process, the coefficients are mapped into different bit planes, and, which indicate significance bits, sign bits and refinement bits in bit plane, respectively. Significance bits are the most important information indicating the ordered structure of bit plane. These bits are encoded with intra-coding method and transmitted into the decoder prior to the other two kinds of bits. Then, the sign and refinement bits are coded with a Slepian Wolf coder, which consists of turbo encoder, parity buffer and turbo decoder. We use a Rate-Compatible Punctured Turbo (RCPT) with 1/2 rate similar to [19] as the core of the Slepian Wolf coder. The turbo code consists of two identical recursive systematic convolutional (RSC) constituent codes with a generator matrix. Fig. 6 shows the structure of the RSC code. is the input bit; is the output systematic bit; and is the convoluted version of, namely the parity bit. D represents one of the four states of the RSC code. When is processed, the four states change as well. The numerator of the generator matrix represents the convolutional points of, and the denominator represents the convolutional points of parity bits. In order to achieve compression, the parity bits are stored in a buffer and partially transmitted by using a puncture schedule. At the decoder, the W-frame is decoded together with side information, which is the prediction of generated from adjacent I-frames. After applying DWT on side information, the decoder can extract the coefficients corresponding to those of using the information of significance bits and form coefficient set. Then, is sent to the turbo decoder to decode the W-frame together with the received parity bits. The decoder will successively decode the coefficients of a bit plane when an acceptable bit error rate (BER) is achieved. It should be noted that, in MVC scenario, the side information can be generated using I-frames from adjacent views. C. Bit-Plane Encoding of Proposed DVC Scheme Bit-plane encoding has been widely used in existing video coding systems due to its inherent scalability. However, we propose to use bit-plane coding in the proposed WZ video coding scheme mainly due to the fact that it can make DWT and SPIHT algorithm more fitful to the DVC scenario. In the existing frame-based DVC scheme, temporal correlations are exploited by using side information through channel coding method, and spatial correlations are exploited by using transform. However, these methods seldom address the utilization of high-order statistical correlations among transform coefficients. The traditional entropy coders usually exploit the high-order statistical correlations by reorganizing the transform coefficient with, for example, run-length coding. However, in DVC scenario, the data structures of the current frame and the side information frame after reorganization should match each other, which prevent the utilization of run-length coding.

GUO et al.: WZ-BASED MVC 717 Fig. 7. Coding process of two bit-plane coding methods. (a) Wavelet-based bit-plane coding without SPIHT. (b) SPIHT-based bit-plane coding.

5 GUO et al.: WZ-BASED MVC 717 Fig. 7. Coding process of two bit-plane coding methods. (a) Wavelet-based bit-plane coding without SPIHT. (b) SPIHT-based bit-plane coding. The bit-plane coding based on DWT and SPIHT can tackle this problem. In the existing DVC schemes, DCT and fix-length quantization are often used before the channel coding. This method has the limitation in fully utilizing correlations among different coefficients and correlations among different bit planes within a coefficient. If only DWT is used, its contribution to coding performance is similar to DCT. However, its capability can be further enhanced by SPIHT in DVC scenario. DWT has an inherent level structure. When it is used to decompose a W-frame, the self-similarity for co-located scales can be utilized to further exploit the correlations between coefficients. In a set of scales generated by DWT, each coefficient in a given scale can be related to a set of coefficients in the next finer scale of similar orientation. This hierarchical structure leads to a fact that if a coefficient in coarse scale is smaller than a threshold in a given bit plane, its descendants in the finer scales are very likely smaller than the threshold. SPIHT is a promising algorithm to utilize this characteristic. Following the above idea, we employ the SPIHT to reorder the transformed coefficients before turbo coding. Fig. 7 shows the coding process for wavelet-based bit-plane coding with and/or without SPIHT. In Fig. 7, each vertical slice represents a transform coefficient and the blocks within the slice represent different bit planes except the block with mesh inside. The white block is a zero bit and the gray block is a nonzero bit. The mesh block is the sign of a coefficients and the position of the sign indicate where it is encoded. Fig. 7(a) presents the coding process of DWT and uniform quantization, while Fig. 7(b) presents the coding process of DWT and SPIHT. In Fig. 7(a), the coefficients are not sorted and all sign bits are encoded at the first position of the coefficients. We can obviously observe that many bits have been wasted in coding the zeros from both coefficients and bit planes. In Fig. 7(b), SPIHT can sort the coefficients and output a sign just after the first nonzero bit in the coefficient is encoded. Thus, the bits are used efficiently. After reordered by SPIHT, the coefficients have been classified into three types: significance bits, sign bits and refinement bits. Significance bits are firstly output by SPIHT at each bit plane, then the sign bits and finally the refinement bits. Through the coding process of SPIHT, we can see that the three types of bits still maintain the characteristic of a real value. Significance bits represent the maximum value of a nonzero coefficient; sign bits represent the sign of the coefficient; and the refinement bits represent the bits following the first nonzero position of a coefficient. Therefore, the correlation between W-frame and the side information still exists. The coding process can be described as follows. First, significance bits are encoded and transmitted with entropy coding method such as arithmetic coder. Then, sign bits and refinement bits are WZ coded. Although the significance bits in the W-frame are still correlated with that in the side information frame, we still intra-code these bits first for the instruction of the following sign and refine bits coding. The sign bits are coded with WZ method, because their correlations have been considered in the joint distribution model. They can be decoded using the significance bits, which have been decoded at decoder. A Laplacian model,, is assumed to represent the distribution between W-frame and side information. is the difference between corresponding coefficients in and, and is the parameter of the distribution model. This distribution is also used to describe prediction error in traditional hybrid video coding scheme. D. Bit-Plane Decoding of Proposed DVC Scheme Decoding process is an important factor which can affect coding performance largely. The significance bits, which have been intra-coded at the encoder with arithmetic code, are decoded first at the decoder. After that, the significance bits will be used to help finish the decoding of the following WZ bits, i.e., the sign bits and refinement bits. For WZ bits, a Log-MAP algorithm [20] is used to successively decode them with the side information until an acceptable BER is achieved. Let indicate a WZ bit in the th bit plane of the coefficient at position in ; represents the coefficient at position

6 718 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 in. Decoder computes the log-likelihood ratio (LLR) for information as with offset (10) The reconstructed value of indicated by is decided as (1) where represents the magnitude of the bit plane of. offset is an estimated value used to compensate the lower part of. is used to adjust the sign of the value of, which is defined as Considering practical trellis codes, (1) can be expressed as where is the state of the encoder at time, is the set of all possible states, is the set of all transitions from state to with input 0, and is similarly defined with input 1. can be computed as where is computed recursively as with initial conditions and. is computed in a back-ward recursion as with the boundary conditions and. is the number of input bits. can be computed as where is the estimated value of, is the estimated output of parity bits, and punctured. Let denote a WZ coded bit. Therefore, can be either a sign bit or a refinement bit. We aim at the computation of in different cases. If is a sign bit, the previously decoded significance bit should be used to compute. is then denoted as. Assume that the parameter of Laplacian model between and is,wehave (2) (3) (4) (5) (6) (7) (8) (9) If is a refinement bit, the previously decoded bit planes and the sign bit should be jointly considered. is then denoted as.wehave with (11) offset (12) where represents the value of the previously decoded bit planes. and are similarly defined as that in (10). is determined by but not in this case. The value of offset depends on the distribution parameter and the bit plane. After the current bit plane is decoded, it will be used to help the decoding of the next bit plane. IV. SIDE INFORMATION GENERATION In recent years, distributed source coding techniques have been developed very fast. The performance of many DSC systems [21], [22] can be very close to the theoretical bound. It seems that WZ theory can be well proved in the field of DSC. However, the situation is different in the practical video coding scenario. Currently, there is still a big performance gap between DVC and traditional video coding. The major reason is as follows. In DSC, the virtual correlation channel between input source and the side information can be described by a distribution model with explicit parameters. However, in practical video coding, the side information in DVC is usually not as accurate as the reference frame from motion-compensated prediction in traditional video coding. Moreover, the correlation model between the current frame and its side information is not exactly known without the correlation exploration in the encoder. As for the traditional video coding, correlation exploitation (i.e., motion-compensated prediction) is done between the current frame and the previously decoded frame in the encoder. After that, the prediction errors are sent into an entropy encoder, which exploits the statistical dependencies within the error frame. In existing hybrid video coding systems such as H.264 [23], the techniques for motion-compensated prediction have been improved largely by utilizing rate-distortion optimized mode decision. The correlations between current frame and the reference frame is then well exploited. This is an

7 GUO et al.: WZ-BASED MVC 719 Fig. 9. Disparity compensated interpolation for side information. Fig. 8. Motion compensated interpolation for side information. important reason why the coding performance of the traditional video coding can be close to the theoretical bound. As for the distributed video coding, the correlation exploration part is shifted to the decoder. Thus, the decoder has to estimate the motion between the current frame and its reference frames without access to. This sometimes leads to a serious mismatch in some regions in the generated side information, especially for the regions containing high motion. In other words, DVC still has an inherent shortcoming, which prevents it from reaching high performance. Therefore, side information becomes one of the most important factors that will affect coding performance in a DVC scheme. In the MVC scenario, the drawbacks of DVC can be compensated to some extent. As we know, inter-view correlations exist in multiview video sequences in addition to temporal correlations. Thus, inter-view correlations can be used to help the texture and motion prediction from one view to its adjacent view. By this means, the prediction errors will not be significant even if temporal prediction does not work well in the decoder. The inter-view prediction between a stereoscopic image pair has been proposed in [24]. Following this idea, we propose a side information generation algorithm, in which the temporal prediction and inter-view prediction are jointly considered to help the control of prediction errors between and. A. Temporal Motion Prediction Motion-compensated interpolation is the general method in DVC schemes. Since temporal correlations are always stronger than inter-view correlations especially in those areas with smooth motion, we employ the temporal prediction as the basic generation method for side information. This method is similar to the symmetric method in the prediction of traditional B-frames. As shown in Fig. 8, is a W-frame in time ; and are the key frames adjacent to. We need to finish the motion-compensated prediction for although it is absent. We assume that most of the motions in three successive frames are linear and the motion vectors of can be derived from the motion vectors from the adjacent two key frames. For forward prediction, if the motion vector of block is, then can be derived from the motion vector of co-located block in with the equation. Using the Fig. 10. Flexible prediction for side information. same method, we can get the backward prediction motion vector. After that, we can get the two prediction blocks of from and. Let represent the prediction value of, then we can get through computing the equation. By this means, most area of can be predicted and the side information can be achieved. B. Inter-View Disparity Prediction In multiview video, there exist inter-view texture and motion correlations. In order to fully utilize the inter-view correlation, a prediction method using frames from adjacent views is proposed here. Due to the special characteristic of multiview video, frames at the same time instant in different views are usually captured by cameras from different angles and locations. This kind of disparity between adjacent views can be described with global motion models, which have been extensively used for pixel prediction in existing MVC schemes. In this paper, we propose to use a six-parameter affine model [25] to exploit the inter-view correlations. The affine model can be described as the following equation: (13) where, and, represent the locations of current frame and reference frame, respectively.,,, and, are global motion parameters. As shown in Fig. 9, when the inter-view prediction is used, the block in can be predicted by finding corresponding pixels from and, where and are the co-located frame from left view of and right view of, respectively.

8 720 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 Fig. 11. Simulated results for: (a) Foreman, (b) Mother&Daughter, and (c) Akiyo. Assume that is a block in, and is the pixel of with coordinate. Then (14) where is the inter-view prediction value of, and,, is the global warp toward left view or right view. Note that, the corresponding position of may be a float value. In this case, the closest integer position is chosen. C. Flexible Prediction For WZ-based MVC, the side information can be generated from temporal or view direction, which can compensate the drawbacks in mono-view DVC schemes. Since there is no current frame when prediction is done in DVC scenario, the motion information cannot be estimated accurately. In this case, inter-view correlation becomes more helpful than that in the traditional video coding schemes. Towards this goal, we propose a more flexible and accurate side information generation algorithm considering both temporal direction and view direction. The key point is how to judge when temporal or inter-view correlation should be used. In our previous work [8], we have proved that there is strong motion correlation between two adjacent views. Thus, we propose to use the motion information of adjacent views as the prediction liability of current view. As shown in Fig. 10, is a pixel in the current W-frame. Let denote the prediction of, i.e., the side information. We then find the best prediction value for. For convenience, we take the left view as an instance to describe the algorithm. First, we find the corresponding pixels of in left view using global warp, denoted as. Then, the block containing can be found, denoted as. After finishing temporal mode decision between and, the coding mode of can be selected, denoted as. The same procedure is then used to achieve. The mode decision scheme in H.264 reference software is adopted here. In the end, we use and to decide the prediction direction of. If the two modes are both inter-modes, the value of is computed using temporal prediction method. Otherwise, it is computed using inter-view prediction method. It should be noted that, in practical implementation, the temporal prediction frame generated using and, and the motion search in left and right views can be finished in advance.

GUO et al.: WZ-BASED MVC 721 Fig. 12. Typical side information frame generated from inter-view prediction, temporal prediction and flexible prediction. (a) Inter-view interpolation.

9 GUO et al.: WZ-BASED MVC 721 Fig. 12. Typical side information frame generated from inter-view prediction, temporal prediction and flexible prediction. (a) Inter-view interpolation. (b) Temporal interpolation. (c) Reconstructed WZ frame using flexible prediction method. V. EXPERIMENTAL RESULTS A. Evaluation of Wavelet-Based WZ Video Coding In order to verify the coding efficiency of the proposed wavelet-based WZ video coding scheme, results of three test sequences including foreman, Mother&Daughter and Akiyo (QCIF, 4:2:0) are presented. In each sequence, 200 frames are selected and the GOP structure is IWIW, where I-frame is encoded with H.263+ intra-coder and frame is encoded with the proposed WZ coder. To evaluate the performance of WZ coding, we change the quality of side information for W-frames according to corresponding I-frames. At each point, the scheme will choose the best performance by adjusting the coding bit plane number of WZ coefficients. All the frames including I-frames and W-frames are counted in the results. Fig. 11 shows the rate-distortion (R-D) curves. We approximate the parameters of the Laplacian model by fitting the difference between the reconstructed I-frame and its side information frame, and hence each bit plane may have the different value. The curve of 263+ I-frame indicates the results of intracoding using H.263+ for the W-frames, which is taken as the benchmark. The curve of 263+ IBIB is taken as the upper bound since the joint encoding is used in this case. We also take the pixel-domain WZ coding as an anchor, denoted as pixel domain. In pixel domain coding, pixels of W-frame is quantized into different levels without being transformed. Then, the quantized pixels are sent into Turbo codes. Similar to [14], we choose three quantization levels: 2, 4, and 16. The difference is that we replace fixed side information generation by dynamic generation. Thus, the best quantization level at each point will be chosen from them. Results of wavelet-based approach and SPIHT-based approach are given, denoted as Wavelet and SPIHT, respectively. The former approach indicates that only DWT is utilized without the proposed SPIHT-based coding. In particular, the transform coefficients of a subband are quantized into a fix length, and the quantized coefficients are coded using turbo codes. According to the testing results, the proposed SPIHT-based approach is much better than the H.263+ intracoding, and also outperforms the wavelet-domain WZ coding up to 1.2 db. B. Evaluation of WZ-Based MVC In order to verify the coding efficiency of the proposed WZ-based MVC approach, experiments on real multiview video sequences are also performed. Test sequences Race1 and Crowd with the resolution of [26], Rena with the resolution of and Breakdancers with the resolution of are used in the testing. We choose three views from each sequence with 128 frames in each view. The GOP structure is IWIW, and hence a W-frame can be predicted from its forward and backward I-frame and/or the co-located I-frames from adjacent views. The flexible prediction method

10 722 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 Fig. 13. Coding performance of proposed scheme for Crowd (left-top), Race1 (right-top), Rena (left-bottom) and Breakdancers (right-bottom). considering both temporal prediction and inter-view prediction is used to generate the side information at the decoder. Before giving the overall coding performance, we first show the visualized prediction results. Fig. 12 shows the results of the side information interpolation and the reconstructed W-frame. Fig. 12(a) is the side information generated only from inter-view prediction and Fig. 12(b) is the side information generated only from temporal prediction. We can obviously observe that the temporal prediction method can achieve a good performance in most areas with constant or static motion. However, in the areas with high motion (e.g., the boxed region), occlusions and new emerging regions may occur in the boundaries of the moving objects. Large prediction errors occur in this case. As for the inter-view prediction, the situation is different. In the regions that cannot be predicted well by temporal prediction, the inter-view prediction can give a good compensation. Supposing the hybrid video coding is applied, these regions are most probably the intra-coded blocks even though the traditional motion estimation is performed. With the previously proposed flexible prediction method, we can generate a side information frame with much less errors than the cases shown in Fig. 12(a) and (b). The reconstructed WZ frame using flexible prediction method is shown in Fig. 12(c). The high quality side information frame is very helpful for improving the coding efficiency, while the bits of WZ coding are still necessary to remove the large local errors. The overall coding performance is also evaluated. Fig. 13 gives the R-D curves of the four multiview videos. We compare three WZ coding methods, as shown in the figure. The curve of H.263+ I-frames indicates the results of 263+ intra-coding; the curve of Temporal indicates WZ coding with temporal prediction only; the curve of Temporal_View indicates the WZ coding with joint temporal and inter-view prediction; the curve of H.263+ IBIB represents the simulcast coding using H.263+; the curve of H.263+ View represents the results of H.263+ using both temporal and inter-view prediction; and the curve of H.263+ Z-M represents the simulcast coding using H.263+ with zero motion, which is for complexity-performance analysis later. As shown in Fig. 13, the proposed MVC approach with joint temporal and inter-view prediction outperforms the H.263+ intra-coding up to 7 db, and also outperform the WZ coding with only temporal prediction up to 1.5 db. In other words, at the same PSNR value, the proposed method can significantly reduce bit rate compared to H.263+ intra-coding. For example, for Crowd, the PSNR of I-frames for side information generation are about 36 db. According to the R-D curves, at the point of 36 db, more than 50% bits can be saved. We also use inter-view prediction coding of H.263+ with B-frames as anchors. Compared to simulcast case, each B-frame has four reference frames in this case: two from temporal direction and

11 GUO et al.: WZ-BASED MVC 723 two from adjacent views. From the figure, we can see that the gap between simulcast case and inter-view prediction case is small. This is mainly because the inter-view prediction is only utilized in B-frames, whereas the B-frames have already been well predicted from the two temporal neighbors. From the figure, we can also observe that the inter-view prediction is more helpful in the distributed MVC scenario rather than in the traditional MVC scenario. The complexity of WZ encoder mainly comes from the operations including DWT, SPIHT coding, entropy coding for significance bits, and turbo coding. DWT decomposition, entropy coding and turbo codes has the similar computing complexity to the transform and variable length coding in traditional video encoder. The SPIHT coding itself involves a very fast approach with precomputing all significant coefficients in one pass. Thus, the complexity of SPIHT coding is similar to a table look-up. Therefore, the WZ scheme has similar complexity to I-frame coding without intra-prediction. Moreover, the buffer needed by W-frame is less than B-frame. In summary, the overall cost of the WZ scheme is much less than the IBIB scheme. Further, we compare the proposed scheme with H.263+ IBIB coding with zero motion (H.263+ Z-M). In this case, the complexity of B-frames is reduced largely, which is similar to that of WZ frames. And meanwhile, the temporal correlation is utilized at the encoder side, which can improve the coding efficiency compared with I-frames. As shown in the figure, for the sequences containing static background and simple motions (e.g., Rena and Breakdancers), the performance of H.263+ Z-M is better than the WZ method. However, for the sequences with complex local motions (e.g., Crow and Race1), the performance of H.263+ Z-M becomes worse. It should be noted that the improvement in the latter case is more significant, because the corresponding sequences are hard to compress. In other words, the proposed method intends to achieve improvements in the tough situation. VI. CONCLUSION In this paper, we have presented a novel MVC system based on WZ coding technology. In the proposed approach, a wavelet and SPIHT-based WZ video coding scheme is used as the core. A more flexible prediction method considering both temporal and inter-view prediction is then proposed to help the side information generation at the decoder. With the proposed scheme, the inter-camera communication is avoided and the large computing complexity is moved from the encoder to the decoder. Meanwhile, the coding performance is very promising compared to the traditional intra-coding. REFERENCES [1] B. Wilburn, N. Joshi, V. Vaish, M. Levoy, and M. Horowitz, High speed video using a dense camera array, Proc. CVPR, pp , [2] J. Lou, H. Cai, and J. Li, A real time interactive multiview video system, presented at the 13th ACM Int. Conf. Multimedia, Singapore, Nov. 6 11, [3] R. S. Wang and Y. Wang, Multiview video sequence analysis, compression, and virtual viewpoint synthesis, IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 4, pp , Apr [4] N. Grammalidis and M. G. Strintzis, Disparity and occlusion estimation in multiocular systems and their coding for the communication of multiview image sequences, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 3, pp , Jun [5] K. Mueller, P. Merkle, A. Smolic, and T. Wiegand, Multiview coding using AVC, presented at the MPEG2006/m12945, 75th MPEG meeting, Bangkok, Thailand, Jan [6] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, Comparative study of MVC prediction structures, in Joint Video Team of ISO/IEC MPEG & ITU-T VCEG, Marrakech, Morocco, Jan. 2007, Doc. JVT-V13. [7] W. Yang, Y. Lu, F. Wu, J. Cai, K.-N. Ngan, and S. Li, 4D wavelet-based multiview video coding, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 11, pp , Nov [8] X. Guo, Y. Lu, F. Wu, and W. Gao, Inter-view direct mode in multiview video coding, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 12, pp , Dec [9] J. D. Slepian and J. K. Wolf, Noiseless coding of correlated information sources, IEEE Trans. Inf. Theory, vol. IT-19, no. 7, pp , Jul [10] A. D. Wyner and J. Ziv, The rate-distortion function for source coding with side information at the decoder, IEEE Trans. Inf. Theory, vol. 22, no. 1, pp. 1 10, Jan [11] R. Puri and K. Ramchandran, PRISM: A new robust video coding architecture based on distributed compression principles, presented at the 40th Allerton Conf. Communication, Control Computing, Allerton, IL, Oct [12] S. S. Pradhan and K. Ramchandran, Distributed source coding using syndromes (DISCUS): Design and construction, IEEE Trans. Inf. Theory, vol. 49, no. 3, pp , Mar [13] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, Distributed video coding, Proc. IEEE, vol. 93, no. 1, pp , Jan [14] A. Aaron, S. Rane, E. Setton, and B. Girod, Transform-domain Wyner Ziv codec for video, presented at the SPIE Visual Commun. Image Process. Conf., San Jose, CA, [15] X. Zhu, A. Aaron, and B. Girod, Distributed compression for large camera arrays, in Proc. IEEE Workshop Statistical Signal Process.,St Louis, MO, Sep. 2003, pp [16] X. Guo, Y. Lu, F. Wu, W. Gao, and S. Li, Distributed multiview video coding, in Proc. SPIE Visual Commun. Imaging, San Jose, CA, Jan. 2006, vol. 6077, pp [17] X. Guo, Y. Lu, F. Wu, W. Gao, and S. Li, Wyner Ziv video coding based on set partitioning in hierarchical tree, in Proc. ICIP, 2006, pp [18] A. Said and W. Pearlman, A new, fast and efficient image codec based on set partitioning in hierarchical trees, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp , Jun [19] D. Rowitch and L. Milstein, On the performance of hybrid fec/arq systems using rate compatible punctured turbo codes, IEEE Trans. Commun., vol. 48, no. 6, pp , Jun [20] C. Berrou and A. Glavieux, Near optimum error correcting coding and decoding: Turbo-codes, IEEE Trans. Commun., vol. 44, no. 10, pp , Oct [21] Z. Xiong, A. Liveris, and S. Cheng, Distributed source coding for sensor networks, IEEE Signal Process. Mag., vol. 21, no. 5, pp , Sep [22] A. Liveris, Z. Xiong, and C. Georghiades, Compression of binary sources with side information at the decoder using ldpc codes, IEEE Commun. Lett., vol. 6, no. 10, pp , Oct [23] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp , Jul [24] M. Flierl and P. Vandergheynst, Distributed coding of dynamic scenes with motion-compensated wavelets, presented at the IEEE MMSP, Siena, Italy, Sep [25] F. Dufaux and J. Konrad, Efficient, robust, and fast global motion estimation for video coding, IEEE Trans. Image Process., vol. 9, pp , Mar [26] R. Kawada, KDDI multiview video sequences for MPEG 3DAV use, in 68th MPEG Meeting, Munich, German, Mar. 2004, MPEG2004/ M10533.

724 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 Xun Guo received the B.S., M.S., and Ph.D. degrees from Harbin Institute of Technology (HIT), Harbin, China, in 1999, 2001, and 2007, respectively, all in computer science.

He was with Microsoft Research Asia (MSRA) as an intern from 2004 to 2006. He joined MediaTek, Inc., Beijing, China, in 2007, where he is currently a Senior Researcher.

degrees from Harbin Institute of Technology (HIT), Harbin, China, in 1997, 1999, and 2003, respectively, all in computer science.

He was with the Joint R&D Lab (JDL) for advanced computing and communication, Chinese Academy of Sciences, Beijing, China, during 2001 to 2004.

His research interests include image and video coding, multimedia streaming, and compression-enabled graphics applications. Dr.

12 724 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 6, JUNE 2008 Xun Guo received the B.S., M.S., and Ph.D. degrees from Harbin Institute of Technology (HIT), Harbin, China, in 1999, 2001, and 2007, respectively, all in computer science. He was with the Joint R&D Lab (JDL) for advanced computing and communication, Chinese Academy of Sciences, Beijing, China, from 2001 to He was with Microsoft Research Asia (MSRA) as an intern from 2004 to He joined MediaTek, Inc., Beijing, China, in 2007, where he is currently a Senior Researcher. His research interests include video coding and streaming, image processing, and HVS-based compression. Yan Lu (S 02 M 07) received the B.S., M.S., and Ph.D. degrees from Harbin Institute of Technology (HIT), Harbin, China, in 1997, 1999, and 2003, respectively, all in computer science. He was a Research Assistant with the Department of Computer Science, City University of Hong Kong, Hong Kong SAR, during 1999 to He was with the Joint R&D Lab (JDL) for advanced computing and communication, Chinese Academy of Sciences, Beijing, China, during 2001 to Since April 2004, he has been with Microsoft Research Asia, where he is currently a Researcher. His research interests include image and video coding, multimedia streaming, and compression-enabled graphics applications. Dr. Lu won the IS&T/SPIE Visual Communications and Image Processing Best Paper Awards in Feng Wu (M 99 SM 06) received the B.S. degree in electrical engineering from Xidian University, Xidian, China, in 1992 and the M.S. and Ph.D. degrees in computer science from Harbin Institute of Technology, Harbin, China, in 1996 and 1999, respectively. He joined in Microsoft Research Asia, Beijing, China, as an Associated Researcher in 1999, and has been a Researcher since His research interests include image and video representation, media compression and communication, computer vision and graphics. He has been an active contributor to ISO/MPEG and ITU-T standards. Some techniques have been adopted by MPEG-4 FGS, H.264/MPEG-4 AVC and the coming H.264 SVC standard. He served as the chairman of China AVS video group in and led the efforts on developing China AVS video standard 1.0. He has authored or co-authored over 100 conference and journal papers. He has about 30 U.S. patents granted or pending in video and image coding. Debin Zhao received the B.S., M.S., and Ph.D. degrees in computer science from the Harbin Institute of Technology (HIT), Harbin, China, in 1985, 1988 and 1998, respectively. He joined the Department of Computer Science of HIT as an Associate Professor in Currently, he is Professor of HIT and Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. He has been a Research Fellow in Department of Computer Science, City University of Hong Kong. His research interests include multimedia compression and its related applications. He has authored or co-authored over 80 publications. He has obtained a National Science and Technology Progress Award (Second prize) in Dr. Zhao Excellent Teaching Award from Baogang Foundation in Wen Gao (M 99 SM 05) received the M.S. and the Ph.D. degrees in computer science from Harbin Institute of Technology, Harbin, China, 1985 and 1988, respectively, and the Ph.D. degree in electronics engineering from University of Tokyo, Tokyo, Japan, in He was a Research Fellow with the Institute of Medical Electronics Engineering, University of Tokyo, in 1992, and a Visiting Professor at Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, in From 1994 to 1995, he was a Visiting Professor with Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge. Currently, he is a Professor with the School of Electronic Engineering and Computer Science, Peking University, Peking, China, and a Professor in computer science at City University of Hong Kong, and the External Fellow of International Computer Science Institute, University of California, Berkley. He has published seven books and over 200 scientific papers. His research interests are in the areas of signal processing, image and video communication, computer vision and artificial intelligence. Dr. Gao is Editor-in-Chief of the Chinese Journal of Computers. He chairs the Audio Video coding Standard (AVS) workgroup of China. He is the head of Chinese National Delegation to MPEG working group (ISO/SC29/WG11).

Frequency Band Coding Mode Selection for Key Frames of Wyner-Ziv Video Coding

2009 11th IEEE International Symposium on Multimedia Frequency Band Coding Mode Selection for Key Frames of Wyner-Ziv Video Coding Ghazaleh R. Esmaili and Pamela C. Cosman Department of Electrical and