Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute (HHI)

Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6) 9 th Meeting: 2-5 September 2003, San Diego Document: JVT-I032d1 Filename: JVT-I032d5.doc Title: Status: Purpose: Author(s) or Contact(s): Source: SNR-scalable Extension of H.264/AVC Input Document to JVT Information Heiko Schwarz, Detlev Marpe, and Thomas Wiegand Image Processing Department, Einsteinufer 37, D-10587 Berlin, Germany Tel: Email: +49 30 31002 617 hschwarz@hhi.de marpe@hhi.de wiegand@hhi.de Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute (HHI) Abstract This document contains a description of an SNR-scalable extension of H.264/AVC [1]. To achieve an efficient SNR-scalable bitstream representation of a video sequence, the temporal dependency between pictures is coded using an open-loop subband approach. In this codec, most components of H.264/AVC are used as specified in the standard while only a few have been adjusted to the subband coding structure. We have tested a first version of the approach with QCIF and CIF resolution sequences obtaining some promising results. 1. Introduction Inspired by recent advances [2][3][4] in temporal subband coding of video sequences, we have investigated the possibility of an SNR-scalable extension of H.264/AVC. The main reason for these recent advances is the lifting representation of a filterbank as originally suggested in [5]. This lifting representation of temporal subband decompositions permits the use of known methods for motion-compensated prediction. Moreover, most other components of a hybrid video codec such as H.264/AVC can be used without modifications while only a few parts need to be adjusted. In the following, a brief review of the lifting framework is given together with a presentation of two practically important examples. The generic lifting scheme consists of three steps, the polyphase decomposition, the prediction step, and the update step, as depicted in Figure 1 (a). In the following we describe these steps performed for the analysis filterbank, i.e., at the encoder side. The polyphase decomposition separates the even and the odd samples of a given signal s[k]. In the case of temporal subband coding of video sequences, the samples s[k] correspond to pictures, but for simplicity please assume the s[k] are scalar values for now. Since the correlation structure typically shows a local characteristic, the even and odd polyphase components are highly correlated, and therefore, in a subsequent step, a prediction of the odd samples from the even samples is performed. The corresponding prediction operator P for each odd sample s odd [] k = s[ 2k + 1] is a linear combination of its neighboring even samples seven [ k] = s[ 2k], i.e., P )[ k] = p s [ k l]. ( s l even l even + File:JVT-I032d6.doc Page: 1 Date Saved: 2003-09-01

As a result of the prediction step, we replace the odd samples by its corresponding prediction residuals h[] k = s odd [] k P( s even )[]. k Note that the prediction step is equivalent to applying a highpass filter of a two-channel filterbank [6] and in case of video sequence coding it is similar to motion-compensated prediction, e.g. as described in [1]. S k z -1 2 S 2k+1 P U F h h k F h -1 U P S 2k+1 2 z S k 2 S 2k F l l k F l -1 S 2k 2 (a) Lifting Scheme (Analysis Filterbank) (b) Inverse Lifting Scheme (Synthesis Filterbank) Figure 1: The lifting representation of an analysis filterbank (a), and the inverse lifting representation of the corresponding synthesis filterbank (b). Finally, in the update step of the lifting scheme, a low-pass filtering is performed by updating the even samples s even [] k with a linear combination of the prediction residuals h[k]. The corresponding update operator U is given by U ( h k = u h k + l )[ ] l [ ]. l By replacing the even samples with l [] k s [] k + h) [] k =, the given signal s[k] can finally be even U( represented by l[k] and h[k], each at half temporal sampling rate as s[k]. Since both the update and the prediction step are fully invertible, the corresponding transform can be interpreted as a critically sampled perfect reconstruction filterbank. In fact, it can be shown that any biorthogonal family of FIR filters can be realized with a sequence of prediction and update steps [6]. For a normalization of the low- and high-pass components, appropriately chosen scaling factors F l and F h are applied, respectively. At the decoder side, the inverse of the described lifting scheme is performed, which corresponds to the synthesis filterbank is shown in Figure 1 (b). The synthesis filterbank simply consists of the application of the prediction and update operator in reversed order with inverted signs in the summation process followed by the reconstruction process using the even and odd polyphase components. The Haar wavelet simply given by In the case of the Haar wavelet, the prediction and update operators are P ( s even )[] k s[ 2k] and U ( h )[] k = h[], k Haar = 1 such that h[] k = s[ 2k + 1] s[ 2k] and [] k = s[ k] + 1 h[] k = s 2k + s 2k + 1 Haar 2 2 2 1 2 l correspond to the (non-normalized) high-pass and low-pass (analysis) output of the Haar filter, respectively. It should be noted that a correspondence between the Haar wavelet and predictive coding as specified e.g. for P slices in [1] as described later. File:JVT-I032d6.doc Page: 2 Date Saved: 2003-09-01

The 5/3 bi-orthogonal spline wavelet The low- and high-pass analysis filters of the 5/3 spline wavelet have 5 and 3 taps, respectively, and its corresponding scaling function is a B- spline of order 2, hence the naming of the wavelet filter. Its simplicity together with a remarkably good performance in still image coding applications (like JPEG2000) recommends its use in a temporal subband coding scheme. In the lifting framework, the corresponding prediction and update operators of the 5/3 transform are given by 1 1 P 5/3 ( s even )[] k = ( s[ 2k] + s[ 2k + 2] ) and U 5/3( h) [] k = ( h[] k + h[ k 1] ). 2 4 It should be noted that a correspondence between the 5/3 bi-orthogonal spline wavelet and bipredictive coding as specified e.g. for B slices in [1] as described later. z -1 Figure 2: A temporal subband codec with the encoder containing the analysis filterbank and the quantizer (transform, scaling, quantization) and the decoder containing the inverse transform and scaling and the synthesis filterbank. File:JVT-I032d6.doc Page: 3 Date Saved: 2003-09-01

2. Description of the codec 2.1. Analysis-synthesis filterbank Figure 2 depicts the general structure of the utilized filterbank. The depicted filterbank shows a 4 layer dyadic temporal decomposition of the video signal requiring the processing of 2 4 = 16 pictures to arrive at the lowest temporal resolution representation. The introduced delay of this approach is also 16 pictures making it impossible to be used in interactive applications such as videoconferencing. The depicted filterbank utilizes the iterated application of the Haar-based motion-compensated lifting scheme, which consists of a motion-compensated prediction step (M i0 ) as in H.264/AVC and a motion-compensated update step (M i1 ). Both, the prediction and the update step utilize the motion compensation process as specified in [1] followed by the deblocking filter process as specified in [1]. In each stage of the analysis filterbank, two pictures (either original pictures or pictures representing low pass signals generated in a previous analysis stage) are decomposed into a low pass signal, which can be considered as a representation of the commonness of the input pictures, and a high pass signal, which can be considered as a representation of the difference between the input pictures. In the corresponding stage of the synthesis filterbank, the two input pictures are reconstructed given the low and high pass signals. Since, in the synthesis step, the inverse operations of the analysis step are performed, the analysis-synthesis filterbank (without quantization) guarantees perfect reconstruction. When both motion fields M i0 and M i1 are equal to zero, the basic temporal decomposition-composition scheme corresponds to a lifting representation of the Haar filter as discussed in Section 1. In the following the prediction and update steps of the analysis and synthesis process are described in more detail. The motion fields M i0 and M i1 generally specify the motion between two pictures using a subset of the P slice syntax of H.264/AVC [1]. For the motion fields M i0 used by the prediction steps, we incorporated an intra macroblock type, in which the (motioncompensated) prediction signal for a macroblock is specified by an 4x4 array of luma transform coefficient levels and two 2x2 arrays of chroma transform coefficient levels similar to the IN- TRA_16x16 macroblock type of H.264/AVC with all AC coefficients set to zero. In the motion fields M i1 used for the update steps, this macroblock type is not included. 2.1.1 General motion compensated prediction This section describes a general motion compensated prediction process, which is used by the prediction and update steps at both the analysis and synthesis side. Input to this process is a reference picture R, a quantization parameter QP (if required), and a block-wise motion field M with the following properties: For each macroblock of the motion-compensated picture P, the motion field M specifies a macroblock mode, which can be equal to P_16x16, P_16x8, P_8x16, P_8x8, or INTRA. When the macroblock mode is equal to P_8x8, for each 8x8 sub-macroblock, a corresponding sub-macroblock mode is specified (P_8x8, P_8x4, P_4x8, P_4x4). If the macroblock mode is equal to INTRA, the generation of the prediction signal is specified by an 4x4 array of luminance coefficient levels and two 2x2 arrays of chrominance coefficient levels. Otherwise, the generation of the prediction signal is specified by one motion vector with quarter-sample accuracy for each macroblock or sub-macroblock partition. File:JVT-I032d6.doc Page: 4 Date Saved: 2003-09-01

Given the reference picture R and the motion field description M, the prediction signal P is constructed in a macroblock-wise manner as described in the following: If the macroblock mode specified in M is not equal to INTRA, for each macroblock or sub-macroblock partition the following applies: o The luma and chroma samples of the picture P that are covered by the regarded macroblock or sub-macroblock partition are obtained by quarter-sample accurate motion compensated prediction as specified in [1]: p[i,j] = M interp (r, i m x, j m y ), where [m x, m y ] T is the motion vector of the regarded macroblock or submacroblock partition given by M, r[] is the array of luma or chroma samples of the reference picture R, and M interp (.) represents the interpolation process specified for the motion compensated prediction in H.264/AVC with the exception that the clipping to the interval [0;255] is removed. Otherwise (the macroblock mode is equal to INTRA), the following applies: o The given 4x4 array of luminance transform coefficient levels is treated as the array of DC luma coefficient levels for the INTRA_16x16 macroblock type in H.264/AVC, and the inverse scaling and transform process specified in [1] is applied using the given quantization parameter QP, while it is assumed that all AC transform coefficient levels are equal to zero. As a result a 16x16 array res[] of residual luma samples is obtained. The luma samples of the prediction picture P covering the regarded macroblock are constructed according to p[i,j] = 128 + res[i,j]. Note, that for each 4x4 luma block, the obtained prediction signal p[] is constant and represents an approximation of the average of the original 4x4 luma block. o For each chrominance component, the given 2x2 array of chrominance transform coefficient levels is treated as the array of DC chroma coefficient levels, and the inverse scaling and transform process for chroma coefficients specified in [1] is applied using the given quantization parameter QP, while it is assumed that all AC transform coefficient levels are equal to zero. As a result an 8x8 array res[] of residual chroma samples is obtained. The chroma samples of the prediction picture P covering the macroblock are constructed according to p[i,j] = 128 + res[i,j]. Note, that for each 4x4 chroma block, the obtained prediction signal p[] is constant and represents an approximation of the average of the original 4x4 chroma block. After generating the whole prediction picture P, the de-blocking filter as specified in [1] is applied to that prediction picture, whereas the derivation of the boundary filter strength is based File:JVT-I032d6.doc Page: 5 Date Saved: 2003-09-01

only on the macroblock modes (information about INTRA) and the motion vectors specified in the motion description M; furthermore, the clipping to the interval [0; 255] is removed. As it can be seen from the above description, the general process of generating (motion compensated) prediction pictures is nearly identical to the reconstruction process of P slices as described in H.264/AVC [1]. The following differences can be identified: o Removal of the clipping to the interval [0; 255] in the processes of motion compensated prediction and de-blocking. o Simplified INTRA mode reconstruction without intra prediction and with all AC transform coefficient levels set to zero. o Simplified reconstruction for motion compensated prediction modes without residual information. 2.1.2 Prediction step at analysis (encoder) side Given two input pictures A and B as well as the motion vector array M i0 for the block-based motion compensation of picture A towards picture B and a quantization parameter QP, the following operations are performed to obtain a residual picture H: - The picture P representing a prediction of the picture B is generated by invoking the process specified in sec. 2.1.1 with the reference picture A, the motion field description M i0, and the quantization parameter QP as input. - The residual picture H is generated by h[i,j] = b[i,j] p[i,j], where h[], b[], and p[] represent the luma or chroma sample arrays of the pictures H, B, and P, respectively. 2.1.3 Update step at analysis (encoder) side Given the input picture A, the residual picture H obtained in the prediction step as well as the motion vector array M i1 for the block-based motion compensation of picture B towards picture A, the following operations are performed to obtain a picture L representing the temporal low pass signal: - A picture P is generated by invoking the process specified in Sec. 2.1.1 with the reference picture H and the motion field description M i1 as input. - The low pass picture L is generated by l[i,j] = a[i,j] + (p[i,j] >> 1), where l[], a[], and p[] represent the luma or chroma sample arrays of the pictures L, A, and P, respectively. 2.1.4 Update step at synthesis (decoder) side Given the (quantized/constructed) low pass pictures L', the quantized residual picture H' as well as the motion vector array M i1, the following operations are performed to obtain the decoded picture A': - The picture P' is generated by invoking the process specified in Sec. 2.1.1 with the reference picture H' and the motion field description M i1 as input. - The reconstructed picture A is generated by a'[i,j] = l'[i,j] (p'[i,j] >> 1), where a'[], l'[], and p'[] represent the sample arrays of the pictures A', L', and P', respectively. 2.1.5 Prediction step at synthesis (decoder) side Given the quantized residual picture H', the constructed picture A' obtained in the update step at the decoder as well as the motion field M i0, the following operations are performed to obtain the decoded picture B: File:JVT-I032d6.doc Page: 6 Date Saved: 2003-09-01

- A picture P' representing a prediction of the picture B' is generated by invoking the process specified in Sec. 2.1.1 with the reference picture A', the motion field description M i0, and the quantization parameter QP as input. - The reconstructed picture B is generated by b[i,j] = h[i,j] + p[i,j], where b[], h[], and p[] represent the sample arrays of the pictures B', H', and P', respectively. By cascading the basic pair-wise picture decomposition stages, a dyadic tree structure is obtained, which decomposes a group of 2 n pictures into 2 n -1 residual pictures and a single low pass (or intra) picture as depicted in Figure 3 for a group of 8 pictures. original GOP (8 pictures) 1st stage 2nd stage 3rd stage Intra picture (low pass signal) Residual pictures (high pass signals) Figure 3: Temporal decomposition of a group of 8 pictures. It is worth noting that the inverse lifting step at the decoder requires twice the amount of motion compensation and deblocking filter operations than if the same number of pictures would be decoded in a hybrid video decoder when coding one I picture and all remaining pictures are coded as P pictures. 2.2. General coding of pictures and motion fields (Base Layer) For a group of 2 n pictures, (2 n+1-2) prediction data arrays (motion vectors and intra predictors), (2 n -1) residual pictures as well as a single low pass (or intra) picture have to be transmitted. We use slice data partitioning with a few modifications to map these data to NAL units. Prediction data array The prediction data arrays are coded using a subset of the H.264/AVC slice layer syntax consisting of the following syntax elements: File:JVT-I032d6.doc Page: 7 Date Saved: 2003-09-01

- slice header (with changed meaning of some elements) - slice data (subset) o macroblock layer (subset) mb_type (P_16x16, P_16x8, P_8x16, P_8x8, INTRA) if( mb_type = = P_8x8 ) sub_mb_type (P_8x8, P_8x4, P_4x8, P_4x4) if( mb_type = = INTRA ) mb_qp_delta residual blocks (only LUMA_DC and CHROMA_DC) else motion vector differences o end_of_slice_flag - rbsp_slice_trailing_bits The motion vector predictors are derived as specified in [1]. Residual pictures (high-pass signals) The residual pictures are coded using a subset of the H.264/AVC slice layer syntax consisting of the following syntax elements: - slice header (with changed meaning of some elements) - slice data (subset) o macroblock layer (subset) coded_block_pattern mb_qp_delta residual blocks o end_of_slice_flag - rbsp_slice_trailing_bits Low-pass pictures The low-pass pictures are generally coded using the syntax of H.264/AVC [1]. In the simplest version, the low-pass pictures of each GOP are coded independently as intra pictures. The coding efficiency can be improved if the correlations between the low-pass pictures of successive GOP s are exploited. Thus, in a more general version, the low pass pictures are coded as P pictures using reconstructed low pass pictures of previous GOP s as references; intra (IDR) pictures are inserted in regular intervals only to provide random access points. The low pass pictures are decoded and reconstructed as specified in [1] including the de-blocking filter operation. 2.3. SNR-Scalability: Coding of enhancement layers The open-loop structure of the subband approach provides the possibility to efficiently incorporate SNR-scalability. We propose a very simple SNR-scalable extension, in which the base layer is coded as described in Sec. 2.2, and the enhancement layers consist of refinement pictures for the subband signals, which are also coded using the residual picture syntax as described in Sec. 2.2. At the encoder side, residual pictures computed between the original subband pictures generated by the analysis filterbank and the constructed subband pictures obtained after decoding the base or a previous enhancement layer are generated. These residual pictures are quantized using a smaller quantization parameter as in the base or previous enhancement layer(s) and encoded exploiting the residual picture syntax described in Sec. 2.2. At the decoder side, the subband representations of the base layer and the refinement signals of various enhancement layers can be de- File:JVT-I032d6.doc Page: 8 Date Saved: 2003-09-01

coded independently, whereas the final enhancement layer subband representation is obtained by simply adding up the corresponding base layer and enhancement layer residual data either in the transform or spatial domain. Our simulations have shown, that the performance losses in comparison to the single layer approach are reasonably small if the quantization parameters are decreased by a value of six from one layer to the next; this bisection of the quantization step size approximately results in a doubling of the bit-rate from one enhancement layer to another. 3. Operational encoder control 3.1 Selection of the quantization parameters When neglecting the motion and replacing the bit-shift to the right in the update step by a realvalued multiplication by a factor of 1/2, the basic two-channel analysis step can be normalized by multiplying the high-pass samples of the picture H by a factor of 1/sqrt(2) and the low-pass samples by a factor of sqrt(2). Since we neglect this normalization in the realization of the analysis and synthesis filter banks to keep the range of the samples values nearly constant, we have to take it into account during the quantization of the temporal subbands. For the basic two-channel analysis-synthesis filterbank, this can easily be done by quantizing the low-pass signal with half of the quantization step size that is used for quantizing the high-pass signal. This leads to the following quantizer selection process for the specified dyadic decomposition structure of a group of 2 n pictures: Let QP L(n) be the quantization parameter used for coding the low-pass picture obtained after the n-th decomposition stage. The quantization parameters used for coding the high-pass pictures obtained after the i-th decomposition stage are derived by QP H(i) = QP L(n) + 3 * (n + 2 i) Within each temporal subband picture, the quantization parameter QP is held constant in the encoder version used for generating the simulation results. The quantization parameter QP INTRA(i) that is used for quantizing the intra prediction signals of the motion field descriptions M (i-1)0, which are used in the i-th decomposition stage, are derived from the quantization parameters QP H(i) for the high-pass pictures generated in this decomposition stage by QP INTRA(i) = QP H(i) 6. 3.2 Motion estimation and mode selection The motion field descriptions M i0 and M i1 that are used in the prediction and update steps, respectively, are estimated independently. In the following, the process for estimating the motion field description M i0 used in the prediction step is described. The process for estimating M i1 is obtained by interchanging the original pictures A and B and removing the INTRA mode from the set of possible macroblock modes. File:JVT-I032d6.doc Page: 9 Date Saved: 2003-09-01

Given the pictures A and B, which are either original pictures or pictures representing low pass signals generated in a previous analysis stage, and the corresponding arrays of luma samples a[] and b[], the motion description M i0 is estimated in a macroblock-wise manner by the following process: For all possible macroblock and sub-macroblock partitions of a macroblock i inside the picture B, the associated motion vectors m i = [m x, m y ] T are obtained by minimizing the Lagrangian functional m i = arg min m S { ( i, m) + λ R( i, m) } D SAD with the distortion term being given as D SAD ( i, m ) = ( x, y) P b[ x, y] a[ x m, y m ] x y At this, S specifies the motion vector search range inside the reference picture A, P is the area covered by the regarded macroblock or sub-macroblock partition, R(i,m) specifies the number of bits needed to transmit all components of the motion vector m, and λ is a fixed Lagrange multiplier. The motion search proceeds first over all integer-sample accurate motion vectors in the given search range S. Then, given the best integer motion vector, the eight surrounding half-sample accurate motion vectors are tested, and finally, given the best half-sample accurate motion vector, the eight surrounding quarter-sample accurate motion vectors are tested. For the half- and quarter-sample accurate motion vector refinement, the term a[ x mx, y m y ] has to be interpreted as interpolation operator. The mode decision for the macroblock and sub-macroblock modes follows basically the same approach. From a given set of possible macroblock or sub-macroblock modes S mode, the mode p i that minimizes the following Lagrangian functional is chosen: p i = arg min p S mod e { D ( i, p) + R( i, p) } SAD λ. The distortion term is given as D SAD ( i, p) = b[ x, y] a[ x m [ p, x, y], y m [ p, x, y]], ( x, y) P x where P specifies the macroblock or sub-macroblock area and m[p,x,y] is the motion vector associated with the macroblock or sub-macroblock mode p and the partition or submacroblock partition covering the luma location (x, y). The rate term R(i,p) represents the number of bits associated with choosing the coding mode p. For the motion-compensated coding modes it includes the bits for the macroblock type (if applicable), the sub-macroblock type(s) (if applicable) and the motion vector(s); for the INTRA mode, it includes the bits for the macroblock mode and the arrays of quantized luma and chroma transform coefficient levels. The set of possible sub-macroblock types is given by {P_8x8, P_8x4, P_4x8, P_4x4}, and the set of possible macroblock types is given by {P_16x16, P_16x8, P_8x16, P_8x8, INTRA}, whereat the INTRA type is only included if a motion field description M i0 that is used for the prediction step is estimated. y File:JVT-I032d6.doc Page: 10 Date Saved: 2003-09-01

The Lagrangian multiplier λ is set in dependence of the base-layer quantization parameter for the high-pass picture(s) QP Hi of the decomposition stage, for which the motion field is estimated: λ 0.33 2^( QP / 3 4). = Hi 3.3 Temporal placement of the low-pass signals The basic two-channel analysis filterbank decomposes two input pictures A and B into a lowpass picture L and a high-pass picture H. Following the notation in this contribution, the lowpass picture L shares the coordinate system with the original picture A. Thus, assuming perfect (error-free) motion compensation, the pictures A and L are identical. The decomposition structure depicted in Figure 1 is obtained if in all decomposition stages, the even input pictures at temporal sampling positions 0, 2, 4,... are treated as input pictures A and the odd input pictures at temporal sampling positions 1, 3, 5, are treated as input pictures B. This scheme enables efficient temporal scalability allowing temporal sub-sampling down to very small frame rates. However, the temporal distance between the pictures that are decomposed in each two-channel analysis filterbank is increased by a factor of 2 from one decomposition stage to the next. And it is well known that the efficiency of motion compensated prediction decreases when the temporal distance between the reference picture and the picture to be predicted increases. It is possible to realize decomposition schemes in which the temporal distance between the pictures to be decomposed by the two-channel filterbank are increased by a factor less than 2 from one decomposition stage to the next. However, these schemes don t provide the feature of efficient temporal scalability allowing temporal sub-sampling down to very small frame rates, since the distances between neighboring low-pass pictures varies in most of the decomposition stages. In our simulations, we used the decomposition scheme depicted in Figure 4, which we believe constitutes a reasonable compromise between temporal scalability and coding efficiency. The sequence of original pictures is treated as a sequence of input pictures ABABAB AB; thus, this scheme provides one stage of optimal temporal scalability (equal distance between the low-pass pictures). The sequences of low-pass pictures used as input of all following decomposition stages are treated as a sequences of input pictures BAABBA AB, whereby the distances between the low-pass pictures that will be decomposed in following two-channel analysis steps is kept small. File:JVT-I032d6.doc Page: 11 Date Saved: 2003-09-01

original GOP A B A B A B A B A B A B A B A B 1st stage B A A B B A A B 2nd stage B A A B 3rd stage B A 4th stage Figure 4: Temporal placement of low-pass pictures for a group of 16 pictures. 4. Results For evaluating the coding efficiency of the proposed SNR-scalable extension of H.264/AVC, we compared it to an H.264/AVC compliant encoder using a similar degree of encoder optimization. The set of input sequences for this comparison consists of six test sequences with widely varying content; all sequences have been encoded using different resolutions and frame rates as depicted in Table 1. Table 1: Test sequences Sequence Duration Resolution, frame rate Basketball 6.4 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Flowers & Garden 6.4 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Foreman 9.6 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Mobile & Calendar 9.6 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Paris 9.6 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Tempete 6.4 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz For all sequences, resolutions, and frame rates, two versions of the scalable encoder have been tested. In the first version ( Scalable ), the temporal low-pass pictures of each group of pictures (GOP) are coded independently as IDR pictures; in the second version ( Scalable+DPCM ), only the temporal low-pass picture of the first group of pictures is coded as IDR picture, all remaining temporal low-pass pictures are coded as P pictures using the reconstructed temporal low-pass pictures of the previous GOP as reference. The video sequences generally are processed in groups of 16 pictures using the encoder control described in section 3. The rate-distortion curves for the scalable encoders have been obtained by decoding a different amount of enhancement layers of a single scalable bit-stream, which consists of one base and three enhancement layers. The quantization parameter for the temporal low-pass picture of the base layer QP L(4) was set to 28, and for generating the enhancement layers the quantization parameters have been decreased by a value of 6 from one layer to the next. File:JVT-I032d6.doc Page: 12 Date Saved: 2003-09-01

The H.264/AVC compliant encoder was run with three different configurations. By using the first configuration ( IDR16, 1ref ), an IDR picture is inserted every 16 pictures, the remaining pictures are coded as P pictures, and only the previous reconstructed picture is used as reference for motion-compensated prediction in the P pictures. With the second configuration ( IDR0, 1ref ), only the first picture of a video sequence is coded as IDR picture, and all following pictures are coded as P pictures using a single reference picture for motion-compensated prediction. These two encoder configurations are considered as reasonable references for the two versions of the scalable encoder. The rate-distortion curves have been obtained by encoding the video sequences with fixed quantization parameters QP {40, 36, 32, 38, 24, 20}. We added a third encoder configuration ( IDR0, 2B, 5ref ) to the comparison, which is considered as representing nearly optimal H.264/AVC compliant encoding results. With this configuration, only the first picture of a video sequence is coded as IDR picture, two B pictures are inserted between each pair of P pictures, and 5 reference pictures are used. All three H.264/AVC compliant encoders are operated using the Lagrangian coder control described in [7], which uses a similar amount of encoder optimization as the operational control that was used for the scalable encoders (see section 3). For all tested encoders, CABAC was used as entropy coding method. The motion estimation was carried out using an exhaustive search. The search range was set to ±16 integer pixels for QCIF and ±32 pixels for CIF sequences if the reference picture represents a neighboring picture of the current picture, and it was enlarged to ±24 integer pixels for QCIF and ±48 integer pixels for CIF sequences if the reference picture does not represent a neighboring picture. Diagrams with the rate-distortion for all tested encoders, test sequences, resolution, and frame rates are contained in the accompanying Excel document. The Figure 5-Figure 10 show the ratedistortion curves for the CIF versions of the test sequences with a frame rate of 30Hz. The simulation results indicate that the coding efficiency of our first version of the SNR-scalable codec strongly depends on the characteristic of the input sequence. For sequences like Mobile & Calendar, Tempete, or Flowers & Garden that mainly contain global motion, the coding efficiency of our scalable codec s is especially in the low bit-rate range nearly comparable to that of the H.264/AVC compliant encoders. The coding efficiency of the scalable encoders drastically decreases in comparison to the H.264/AVC compliant encoders if the input sequences contain complex motion or a large amount of occlusion areas. Generally, the coding efficiency of the scalable encoders decreases if the frame rate is reduced. When looking at the reconstructed sequences it can be seen, that most coding artifacts occur in image regions that are covered or uncovered inside a group of pictures (Foreman, Paris) or that undergo complex motion (Basketball). This indicates that our simple bit-allocation algorithm, which uses a fixed quantization parameter inside a temporal subband picture, is not optimal. This is related to the fact that the orthonormality of the filterbank is only approximately given if the motion field used in the update step is the inverse of the motion field used in the prediction step. File:JVT-I032d6.doc Page: 13 Date Saved: 2003-09-01

Y-PSNR [db] 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 Basketball - CIF 30Hz - 192 frames (6.4 sec) DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 1000 2000 3000 4000 5000 6000 bit-rate [kbit/s] Figure 5: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Basketball in CIF resolution with a frame rate of 30Hz. 42 41 Flowers & Garden - CIF 30Hz - 192 frames (6.4 sec) 40 39 38 37 Y-PSNR [db] 36 35 34 33 32 31 30 29 28 27 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 500 1000 1500 2000 2500 3000 3500 4000 4500 bit-rate [kbit/s] Figure 6: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Flowers & Garden in CIF resolution with a frame rate of 30Hz. File:JVT-I032d6.doc Page: 14 Date Saved: 2003-09-01

42 Foreman - CIF 30Hz - 288 frames (9.6 sec) 41 40 39 Y-PSNR [db] 38 37 36 35 34 33 32 31 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 200 400 600 800 1000 1200 1400 1600 1800 2000 bit-rate [kbit/s] Figure 7: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Foreman in CIF resolution with a frame rate of 30Hz. 41 40 Mobile & Calendar - CIF 30Hz - 288 frames (9.6 sec) 39 38 37 36 Y-PSNR [db] 35 34 33 32 31 30 29 28 27 26 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 1000 2000 3000 4000 5000 bit-rate [kbit/s] Figure 8: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Mobile & Calendar in CIF resolution with a frame rate of 30Hz. File:JVT-I032d6.doc Page: 15 Date Saved: 2003-09-01

43 Paris - CIF 30Hz - 288 frames (9.6 sec) Y-PSNR [db] 42 41 40 39 38 37 36 35 34 33 32 31 30 29 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 200 400 600 800 1000 1200 1400 1600 1800 2000 bit-rate [kbit/s] Figure 9: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Paris in CIF resolution with a frame rate of 30Hz. 42 Tempete - CIF 30Hz - 192 frames (6.4 sec) Y-PSNR [db] 41 40 39 38 37 36 35 34 33 32 31 30 29 28 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 500 1000 1500 2000 2500 3000 3500 4000 bit-rate [kbit/s] Figure 10: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Tempete in CIF resolution with a frame rate of 30Hz. File:JVT-I032d6.doc Page: 16 Date Saved: 2003-09-01

5. Future research items In the presented approach, the motion vector arrays M i0 and M i1, which are used in the prediction and update step, respectively, are estimated and encoded independently. This does not only increase the bit-rate needed for transmitting the motion parameters, but probably also has a negative influence on the connectivity of these two motion fields, which seems to have an important influence on the coding efficiency of the subband approach. Thus, we believe that the coding efficiency can be improved if the motion fields M i1 used in the update step are not independently estimated and coded, but derived from the motion fields M i0 used in the prediction steps in a way that they are still representing block-wise motion compatible with the H.264/AVC specification. As a side effect, this might also limit the complexity needed for the update step. Our current analysis-synthesis structure represents a lifting representation of the simple Haar filters. This scheme can be extended to a lifting representation of the bi-orthogonal 5/3 filters, which lead to bi-predictive motion compensation. The most promising approach is probably to adaptively switch between the lifting representations of the Haar filters and that of the 5/3 filters on a block-basis, for which the motion compensated prediction as specified for B slices in H.264/AVC can be applied. In addition, it may be beneficial to adaptively choosing the GOP size for the temporal subband decomposition. Since the usage of multiple reference pictures has improved the performance of hybrid video coding schemes considerably, the incorporation of this approach into the subband scheme is an interesting research item. Another important point is the development of more suitable bit-allocation algorithms that reduce the annoying SNR fluctuations inside a group of pictures, which we have observed for some of the test sequences (e.g. Foreman). Furthermore, it should be worth to examine new techniques for transform coefficient coding, which could improve the SNR-scalability and perhaps provide additionally a certain degree of spatial scalability which itself is also on our list of things to be investigated. References [1] T. Wiegand and G. J. Sullivan (eds), Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 ISO/IEC 14496-10 AVC), Doc. JVT-G050r1, May 2003. [2] D. Taubman, Successive refinement of video: fundamental issues, past efforts and new directions, Proc. of SPIE (VCIP 03), vol. 5150, pp. 649-663, 2003. [3] J.-R. Ohm, Complexity and delay analysis of MCTF interframe wavelet structures, ISO/IEC JTC1/WG11 Doc. M8520, July 2002. [4] M. Flierl and B. Girod, "Video coding with motion-compensated lifted wavelet transforms", Proc. PCS 2003. [5] W. Sweldens, A custom-design construction of biorthogonal wavelets, J. Appl. Comp. Harm. Anal., vol. 3 (no. 2), pp. 186-200, 1996. [6] I. Daubechies and W. Sweldens, Factoring wavelet transforms into lifting steps, J. Fourier Anal. Appl., vol. 4 (no. 3), pp. 247-269, 1998. [7] T. Wiegand et al, Rate-Constrained Coder Control and Comparison of Video Coding Standards, IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 688-703, July 2003. File:JVT-I032d6.doc Page: 17 Date Saved: 2003-09-01