Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute (HHI)

Similar documents
Optimum Quantization Parameters for Mode Decision in Scalable Extension of H.264/AVC Video Codec

An Improved H.26L Coder Using Lagrangian Coder Control. Summary

MCTF and Scalability Extension of H.264/AVC and its Application to Video Transmission, Storage, and Surveillance

Wavelet-Based Video Compression Using Long-Term Memory Motion-Compensated Prediction and Context-Based Adaptive Arithmetic Coding

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

Reduced Frame Quantization in Video Coding

(Invited Paper) /$ IEEE

Advanced Video Coding: The new H.264 video compression standard

New Techniques for Improved Video Coding

Digital Video Processing

Video Coding Using Spatially Varying Transform

ADVANCES IN VIDEO COMPRESSION

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

Overview: motion-compensated coding

Performance Comparison between DWT-based and DCT-based Encoders

Fine grain scalable video coding using 3D wavelets and active meshes

BANDWIDTH-EFFICIENT ENCODER FRAMEWORK FOR H.264/AVC SCALABLE EXTENSION. Yi-Hau Chen, Tzu-Der Chuang, Yu-Jen Chen, and Liang-Gee Chen

Scalable Video Coding

Lecture 13 Video Coding H.264 / MPEG4 AVC

IBM Research Report. Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms

Bit Allocation for Spatial Scalability in H.264/SVC

An Efficient Mode Selection Algorithm for H.264

BLOCK MATCHING-BASED MOTION COMPENSATION WITH ARBITRARY ACCURACY USING ADAPTIVE INTERPOLATION FILTERS

The Scope of Picture and Video Coding Standardization

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Scalable Video Coding in H.264/AVC

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Decoded. Frame. Decoded. Frame. Warped. Frame. Warped. Frame. current frame

ARCHITECTURES OF INCORPORATING MPEG-4 AVC INTO THREE-DIMENSIONAL WAVELET VIDEO CODING

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC

An Efficient Table Prediction Scheme for CAVLC

Introduction to Video Coding

EE Low Complexity H.264 encoder for mobile applications

Video Compression Standards (II) A/Prof. Jian Zhang

Rate Distortion Optimization in Video Compression

Complexity Reduced Mode Selection of H.264/AVC Intra Coding

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

OPTIMIZATION OF LOW DELAY WAVELET VIDEO CODECS

THE H.264 ADVANCED VIDEO COMPRESSION STANDARD

Video coding. Concepts and notations.

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain

VHDL Implementation of H.264 Video Coding Standard

VIDEO COMPRESSION STANDARDS

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264.

Xin-Fu Wang et al.: Performance Comparison of AVS and H.264/AVC 311 prediction mode and four directional prediction modes are shown in Fig.1. Intra ch

High Efficiency Video Coding (HEVC) test model HM vs. HM- 16.6: objective and subjective performance analysis

SCALABLE HYBRID VIDEO CODERS WITH DOUBLE MOTION COMPENSATION

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

Recent, Current and Future Developments in Video Coding

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Fast Mode Decision for H.264/AVC Using Mode Prediction

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

LONG-TERM MEMORY PREDICTION USING AFFINE MOTION COMPENSATION

EXPLOITING TEMPORAL CORRELATION WITH ADAPTIVE BLOCK-SIZE MOTION ALIGNMENT FOR 3D WAVELET CODING

Advances of MPEG Scalable Video Coding Standard

Selected coding methods in H.265/HEVC

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC

Adaptation of Scalable Video Coding to Packet Loss and its Performance Analysis

COMPARISON OF HIGH EFFICIENCY VIDEO CODING (HEVC) PERFORMANCE WITH H.264 ADVANCED VIDEO CODING (AVC)

Cross Layer Protocol Design

Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

H.264/AVC BASED NEAR LOSSLESS INTRA CODEC USING LINE-BASED PREDICTION AND MODIFIED CABAC. Jung-Ah Choi, Jin Heo, and Yo-Sung Ho

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Advanced Encoding Features of the Sencore TXS Transcoder

MPEG-4: Simple Profile (SP)

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

MOTION COMPENSATION IN TEMPORAL DISCRETE WAVELET TRANSFORMS. Wei Zhao

White paper: Video Coding A Timeline

Implementation and analysis of Directional DCT in H.264

Lecture 5: Error Resilience & Scalability

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

STACK ROBUST FINE GRANULARITY SCALABLE VIDEO CODING

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Improved H.264/AVC Requantization Transcoding using Low-Complexity Interpolation Filters for 1/4-Pixel Motion Compensation

Video Quality Analysis for H.264 Based on Human Visual System

Unit-level Optimization for SVC Extractor

Pyramid Coding and Subband Coding

ERROR-ROBUST INTER/INTRA MACROBLOCK MODE SELECTION USING ISOLATED REGIONS

INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND AUDIO

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Standard Codecs. Image compression to advanced video coding. Mohammed Ghanbari. 3rd Edition. The Institution of Engineering and Technology

JPEG 2000 vs. JPEG in MPEG Encoding

H.264 / AVC (Advanced Video Coding)

VIDEO AND IMAGE PROCESSING USING DSP AND PFGA. Chapter 3: Video Processing

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Chapter 10 ZHU Yongxin, Winson

Research Article A High-Throughput Hardware Architecture for the H.264/AVC Half-Pixel Motion Estimation Targeting High-Definition Videos

Video Coding Standards

IN the early 1980 s, video compression made the leap from

High Efficiency Video Coding: The Next Gen Codec. Matthew Goldman Senior Vice President TV Compression Technology Ericsson

Smoooth Streaming over wireless Networks Sreya Chakraborty Final Report EE-5359 under the guidance of Dr. K.R.Rao

Department of Electrical Engineering, IIT Bombay.

Mode-Dependent Pixel-Based Weighted Intra Prediction for HEVC Scalable Extension

Transcription:

Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6) 9 th Meeting: 2-5 September 2003, San Diego Document: JVT-I032d1 Filename: JVT-I032d5.doc Title: Status: Purpose: Author(s) or Contact(s): Source: SNR-scalable Extension of H.264/AVC Input Document to JVT Information Heiko Schwarz, Detlev Marpe, and Thomas Wiegand Image Processing Department, Einsteinufer 37, D-10587 Berlin, Germany Tel: Email: +49 30 31002 617 hschwarz@hhi.de marpe@hhi.de wiegand@hhi.de Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute (HHI) Abstract This document contains a description of an SNR-scalable extension of H.264/AVC [1]. To achieve an efficient SNR-scalable bitstream representation of a video sequence, the temporal dependency between pictures is coded using an open-loop subband approach. In this codec, most components of H.264/AVC are used as specified in the standard while only a few have been adjusted to the subband coding structure. We have tested a first version of the approach with QCIF and CIF resolution sequences obtaining some promising results. 1. Introduction Inspired by recent advances [2][3][4] in temporal subband coding of video sequences, we have investigated the possibility of an SNR-scalable extension of H.264/AVC. The main reason for these recent advances is the lifting representation of a filterbank as originally suggested in [5]. This lifting representation of temporal subband decompositions permits the use of known methods for motion-compensated prediction. Moreover, most other components of a hybrid video codec such as H.264/AVC can be used without modifications while only a few parts need to be adjusted. In the following, a brief review of the lifting framework is given together with a presentation of two practically important examples. The generic lifting scheme consists of three steps, the polyphase decomposition, the prediction step, and the update step, as depicted in Figure 1 (a). In the following we describe these steps performed for the analysis filterbank, i.e., at the encoder side. The polyphase decomposition separates the even and the odd samples of a given signal s[k]. In the case of temporal subband coding of video sequences, the samples s[k] correspond to pictures, but for simplicity please assume the s[k] are scalar values for now. Since the correlation structure typically shows a local characteristic, the even and odd polyphase components are highly correlated, and therefore, in a subsequent step, a prediction of the odd samples from the even samples is performed. The corresponding prediction operator P for each odd sample s odd [] k = s[ 2k + 1] is a linear combination of its neighboring even samples seven [ k] = s[ 2k], i.e., P )[ k] = p s [ k l]. ( s l even l even + File:JVT-I032d6.doc Page: 1 Date Saved: 2003-09-01

As a result of the prediction step, we replace the odd samples by its corresponding prediction residuals h[] k = s odd [] k P( s even )[]. k Note that the prediction step is equivalent to applying a highpass filter of a two-channel filterbank [6] and in case of video sequence coding it is similar to motion-compensated prediction, e.g. as described in [1]. S k z -1 2 S 2k+1 P U F h h k F h -1 U P S 2k+1 2 z S k 2 S 2k F l l k F l -1 S 2k 2 (a) Lifting Scheme (Analysis Filterbank) (b) Inverse Lifting Scheme (Synthesis Filterbank) Figure 1: The lifting representation of an analysis filterbank (a), and the inverse lifting representation of the corresponding synthesis filterbank (b). Finally, in the update step of the lifting scheme, a low-pass filtering is performed by updating the even samples s even [] k with a linear combination of the prediction residuals h[k]. The corresponding update operator U is given by U ( h k = u h k + l )[ ] l [ ]. l By replacing the even samples with l [] k s [] k + h) [] k =, the given signal s[k] can finally be even U( represented by l[k] and h[k], each at half temporal sampling rate as s[k]. Since both the update and the prediction step are fully invertible, the corresponding transform can be interpreted as a critically sampled perfect reconstruction filterbank. In fact, it can be shown that any biorthogonal family of FIR filters can be realized with a sequence of prediction and update steps [6]. For a normalization of the low- and high-pass components, appropriately chosen scaling factors F l and F h are applied, respectively. At the decoder side, the inverse of the described lifting scheme is performed, which corresponds to the synthesis filterbank is shown in Figure 1 (b). The synthesis filterbank simply consists of the application of the prediction and update operator in reversed order with inverted signs in the summation process followed by the reconstruction process using the even and odd polyphase components. The Haar wavelet simply given by In the case of the Haar wavelet, the prediction and update operators are P ( s even )[] k s[ 2k] and U ( h )[] k = h[], k Haar = 1 such that h[] k = s[ 2k + 1] s[ 2k] and [] k = s[ k] + 1 h[] k = s 2k + s 2k + 1 Haar 2 2 2 1 2 l correspond to the (non-normalized) high-pass and low-pass (analysis) output of the Haar filter, respectively. It should be noted that a correspondence between the Haar wavelet and predictive coding as specified e.g. for P slices in [1] as described later. File:JVT-I032d6.doc Page: 2 Date Saved: 2003-09-01

The 5/3 bi-orthogonal spline wavelet The low- and high-pass analysis filters of the 5/3 spline wavelet have 5 and 3 taps, respectively, and its corresponding scaling function is a B- spline of order 2, hence the naming of the wavelet filter. Its simplicity together with a remarkably good performance in still image coding applications (like JPEG2000) recommends its use in a temporal subband coding scheme. In the lifting framework, the corresponding prediction and update operators of the 5/3 transform are given by 1 1 P 5/3 ( s even )[] k = ( s[ 2k] + s[ 2k + 2] ) and U 5/3( h) [] k = ( h[] k + h[ k 1] ). 2 4 It should be noted that a correspondence between the 5/3 bi-orthogonal spline wavelet and bipredictive coding as specified e.g. for B slices in [1] as described later. z -1 Figure 2: A temporal subband codec with the encoder containing the analysis filterbank and the quantizer (transform, scaling, quantization) and the decoder containing the inverse transform and scaling and the synthesis filterbank. File:JVT-I032d6.doc Page: 3 Date Saved: 2003-09-01

2. Description of the codec 2.1. Analysis-synthesis filterbank Figure 2 depicts the general structure of the utilized filterbank. The depicted filterbank shows a 4 layer dyadic temporal decomposition of the video signal requiring the processing of 2 4 = 16 pictures to arrive at the lowest temporal resolution representation. The introduced delay of this approach is also 16 pictures making it impossible to be used in interactive applications such as videoconferencing. The depicted filterbank utilizes the iterated application of the Haar-based motion-compensated lifting scheme, which consists of a motion-compensated prediction step (M i0 ) as in H.264/AVC and a motion-compensated update step (M i1 ). Both, the prediction and the update step utilize the motion compensation process as specified in [1] followed by the deblocking filter process as specified in [1]. In each stage of the analysis filterbank, two pictures (either original pictures or pictures representing low pass signals generated in a previous analysis stage) are decomposed into a low pass signal, which can be considered as a representation of the commonness of the input pictures, and a high pass signal, which can be considered as a representation of the difference between the input pictures. In the corresponding stage of the synthesis filterbank, the two input pictures are reconstructed given the low and high pass signals. Since, in the synthesis step, the inverse operations of the analysis step are performed, the analysis-synthesis filterbank (without quantization) guarantees perfect reconstruction. When both motion fields M i0 and M i1 are equal to zero, the basic temporal decomposition-composition scheme corresponds to a lifting representation of the Haar filter as discussed in Section 1. In the following the prediction and update steps of the analysis and synthesis process are described in more detail. The motion fields M i0 and M i1 generally specify the motion between two pictures using a subset of the P slice syntax of H.264/AVC [1]. For the motion fields M i0 used by the prediction steps, we incorporated an intra macroblock type, in which the (motioncompensated) prediction signal for a macroblock is specified by an 4x4 array of luma transform coefficient levels and two 2x2 arrays of chroma transform coefficient levels similar to the IN- TRA_16x16 macroblock type of H.264/AVC with all AC coefficients set to zero. In the motion fields M i1 used for the update steps, this macroblock type is not included. 2.1.1 General motion compensated prediction This section describes a general motion compensated prediction process, which is used by the prediction and update steps at both the analysis and synthesis side. Input to this process is a reference picture R, a quantization parameter QP (if required), and a block-wise motion field M with the following properties: For each macroblock of the motion-compensated picture P, the motion field M specifies a macroblock mode, which can be equal to P_16x16, P_16x8, P_8x16, P_8x8, or INTRA. When the macroblock mode is equal to P_8x8, for each 8x8 sub-macroblock, a corresponding sub-macroblock mode is specified (P_8x8, P_8x4, P_4x8, P_4x4). If the macroblock mode is equal to INTRA, the generation of the prediction signal is specified by an 4x4 array of luminance coefficient levels and two 2x2 arrays of chrominance coefficient levels. Otherwise, the generation of the prediction signal is specified by one motion vector with quarter-sample accuracy for each macroblock or sub-macroblock partition. File:JVT-I032d6.doc Page: 4 Date Saved: 2003-09-01

Given the reference picture R and the motion field description M, the prediction signal P is constructed in a macroblock-wise manner as described in the following: If the macroblock mode specified in M is not equal to INTRA, for each macroblock or sub-macroblock partition the following applies: o The luma and chroma samples of the picture P that are covered by the regarded macroblock or sub-macroblock partition are obtained by quarter-sample accurate motion compensated prediction as specified in [1]: p[i,j] = M interp (r, i m x, j m y ), where [m x, m y ] T is the motion vector of the regarded macroblock or submacroblock partition given by M, r[] is the array of luma or chroma samples of the reference picture R, and M interp (.) represents the interpolation process specified for the motion compensated prediction in H.264/AVC with the exception that the clipping to the interval [0;255] is removed. Otherwise (the macroblock mode is equal to INTRA), the following applies: o The given 4x4 array of luminance transform coefficient levels is treated as the array of DC luma coefficient levels for the INTRA_16x16 macroblock type in H.264/AVC, and the inverse scaling and transform process specified in [1] is applied using the given quantization parameter QP, while it is assumed that all AC transform coefficient levels are equal to zero. As a result a 16x16 array res[] of residual luma samples is obtained. The luma samples of the prediction picture P covering the regarded macroblock are constructed according to p[i,j] = 128 + res[i,j]. Note, that for each 4x4 luma block, the obtained prediction signal p[] is constant and represents an approximation of the average of the original 4x4 luma block. o For each chrominance component, the given 2x2 array of chrominance transform coefficient levels is treated as the array of DC chroma coefficient levels, and the inverse scaling and transform process for chroma coefficients specified in [1] is applied using the given quantization parameter QP, while it is assumed that all AC transform coefficient levels are equal to zero. As a result an 8x8 array res[] of residual chroma samples is obtained. The chroma samples of the prediction picture P covering the macroblock are constructed according to p[i,j] = 128 + res[i,j]. Note, that for each 4x4 chroma block, the obtained prediction signal p[] is constant and represents an approximation of the average of the original 4x4 chroma block. After generating the whole prediction picture P, the de-blocking filter as specified in [1] is applied to that prediction picture, whereas the derivation of the boundary filter strength is based File:JVT-I032d6.doc Page: 5 Date Saved: 2003-09-01

only on the macroblock modes (information about INTRA) and the motion vectors specified in the motion description M; furthermore, the clipping to the interval [0; 255] is removed. As it can be seen from the above description, the general process of generating (motion compensated) prediction pictures is nearly identical to the reconstruction process of P slices as described in H.264/AVC [1]. The following differences can be identified: o Removal of the clipping to the interval [0; 255] in the processes of motion compensated prediction and de-blocking. o Simplified INTRA mode reconstruction without intra prediction and with all AC transform coefficient levels set to zero. o Simplified reconstruction for motion compensated prediction modes without residual information. 2.1.2 Prediction step at analysis (encoder) side Given two input pictures A and B as well as the motion vector array M i0 for the block-based motion compensation of picture A towards picture B and a quantization parameter QP, the following operations are performed to obtain a residual picture H: - The picture P representing a prediction of the picture B is generated by invoking the process specified in sec. 2.1.1 with the reference picture A, the motion field description M i0, and the quantization parameter QP as input. - The residual picture H is generated by h[i,j] = b[i,j] p[i,j], where h[], b[], and p[] represent the luma or chroma sample arrays of the pictures H, B, and P, respectively. 2.1.3 Update step at analysis (encoder) side Given the input picture A, the residual picture H obtained in the prediction step as well as the motion vector array M i1 for the block-based motion compensation of picture B towards picture A, the following operations are performed to obtain a picture L representing the temporal low pass signal: - A picture P is generated by invoking the process specified in Sec. 2.1.1 with the reference picture H and the motion field description M i1 as input. - The low pass picture L is generated by l[i,j] = a[i,j] + (p[i,j] >> 1), where l[], a[], and p[] represent the luma or chroma sample arrays of the pictures L, A, and P, respectively. 2.1.4 Update step at synthesis (decoder) side Given the (quantized/constructed) low pass pictures L', the quantized residual picture H' as well as the motion vector array M i1, the following operations are performed to obtain the decoded picture A': - The picture P' is generated by invoking the process specified in Sec. 2.1.1 with the reference picture H' and the motion field description M i1 as input. - The reconstructed picture A is generated by a'[i,j] = l'[i,j] (p'[i,j] >> 1), where a'[], l'[], and p'[] represent the sample arrays of the pictures A', L', and P', respectively. 2.1.5 Prediction step at synthesis (decoder) side Given the quantized residual picture H', the constructed picture A' obtained in the update step at the decoder as well as the motion field M i0, the following operations are performed to obtain the decoded picture B: File:JVT-I032d6.doc Page: 6 Date Saved: 2003-09-01

- A picture P' representing a prediction of the picture B' is generated by invoking the process specified in Sec. 2.1.1 with the reference picture A', the motion field description M i0, and the quantization parameter QP as input. - The reconstructed picture B is generated by b[i,j] = h[i,j] + p[i,j], where b[], h[], and p[] represent the sample arrays of the pictures B', H', and P', respectively. By cascading the basic pair-wise picture decomposition stages, a dyadic tree structure is obtained, which decomposes a group of 2 n pictures into 2 n -1 residual pictures and a single low pass (or intra) picture as depicted in Figure 3 for a group of 8 pictures. original GOP (8 pictures) 1st stage 2nd stage 3rd stage Intra picture (low pass signal) Residual pictures (high pass signals) Figure 3: Temporal decomposition of a group of 8 pictures. It is worth noting that the inverse lifting step at the decoder requires twice the amount of motion compensation and deblocking filter operations than if the same number of pictures would be decoded in a hybrid video decoder when coding one I picture and all remaining pictures are coded as P pictures. 2.2. General coding of pictures and motion fields (Base Layer) For a group of 2 n pictures, (2 n+1-2) prediction data arrays (motion vectors and intra predictors), (2 n -1) residual pictures as well as a single low pass (or intra) picture have to be transmitted. We use slice data partitioning with a few modifications to map these data to NAL units. Prediction data array The prediction data arrays are coded using a subset of the H.264/AVC slice layer syntax consisting of the following syntax elements: File:JVT-I032d6.doc Page: 7 Date Saved: 2003-09-01

- slice header (with changed meaning of some elements) - slice data (subset) o macroblock layer (subset) mb_type (P_16x16, P_16x8, P_8x16, P_8x8, INTRA) if( mb_type = = P_8x8 ) sub_mb_type (P_8x8, P_8x4, P_4x8, P_4x4) if( mb_type = = INTRA ) mb_qp_delta residual blocks (only LUMA_DC and CHROMA_DC) else motion vector differences o end_of_slice_flag - rbsp_slice_trailing_bits The motion vector predictors are derived as specified in [1]. Residual pictures (high-pass signals) The residual pictures are coded using a subset of the H.264/AVC slice layer syntax consisting of the following syntax elements: - slice header (with changed meaning of some elements) - slice data (subset) o macroblock layer (subset) coded_block_pattern mb_qp_delta residual blocks o end_of_slice_flag - rbsp_slice_trailing_bits Low-pass pictures The low-pass pictures are generally coded using the syntax of H.264/AVC [1]. In the simplest version, the low-pass pictures of each GOP are coded independently as intra pictures. The coding efficiency can be improved if the correlations between the low-pass pictures of successive GOP s are exploited. Thus, in a more general version, the low pass pictures are coded as P pictures using reconstructed low pass pictures of previous GOP s as references; intra (IDR) pictures are inserted in regular intervals only to provide random access points. The low pass pictures are decoded and reconstructed as specified in [1] including the de-blocking filter operation. 2.3. SNR-Scalability: Coding of enhancement layers The open-loop structure of the subband approach provides the possibility to efficiently incorporate SNR-scalability. We propose a very simple SNR-scalable extension, in which the base layer is coded as described in Sec. 2.2, and the enhancement layers consist of refinement pictures for the subband signals, which are also coded using the residual picture syntax as described in Sec. 2.2. At the encoder side, residual pictures computed between the original subband pictures generated by the analysis filterbank and the constructed subband pictures obtained after decoding the base or a previous enhancement layer are generated. These residual pictures are quantized using a smaller quantization parameter as in the base or previous enhancement layer(s) and encoded exploiting the residual picture syntax described in Sec. 2.2. At the decoder side, the subband representations of the base layer and the refinement signals of various enhancement layers can be de- File:JVT-I032d6.doc Page: 8 Date Saved: 2003-09-01

coded independently, whereas the final enhancement layer subband representation is obtained by simply adding up the corresponding base layer and enhancement layer residual data either in the transform or spatial domain. Our simulations have shown, that the performance losses in comparison to the single layer approach are reasonably small if the quantization parameters are decreased by a value of six from one layer to the next; this bisection of the quantization step size approximately results in a doubling of the bit-rate from one enhancement layer to another. 3. Operational encoder control 3.1 Selection of the quantization parameters When neglecting the motion and replacing the bit-shift to the right in the update step by a realvalued multiplication by a factor of 1/2, the basic two-channel analysis step can be normalized by multiplying the high-pass samples of the picture H by a factor of 1/sqrt(2) and the low-pass samples by a factor of sqrt(2). Since we neglect this normalization in the realization of the analysis and synthesis filter banks to keep the range of the samples values nearly constant, we have to take it into account during the quantization of the temporal subbands. For the basic two-channel analysis-synthesis filterbank, this can easily be done by quantizing the low-pass signal with half of the quantization step size that is used for quantizing the high-pass signal. This leads to the following quantizer selection process for the specified dyadic decomposition structure of a group of 2 n pictures: Let QP L(n) be the quantization parameter used for coding the low-pass picture obtained after the n-th decomposition stage. The quantization parameters used for coding the high-pass pictures obtained after the i-th decomposition stage are derived by QP H(i) = QP L(n) + 3 * (n + 2 i) Within each temporal subband picture, the quantization parameter QP is held constant in the encoder version used for generating the simulation results. The quantization parameter QP INTRA(i) that is used for quantizing the intra prediction signals of the motion field descriptions M (i-1)0, which are used in the i-th decomposition stage, are derived from the quantization parameters QP H(i) for the high-pass pictures generated in this decomposition stage by QP INTRA(i) = QP H(i) 6. 3.2 Motion estimation and mode selection The motion field descriptions M i0 and M i1 that are used in the prediction and update steps, respectively, are estimated independently. In the following, the process for estimating the motion field description M i0 used in the prediction step is described. The process for estimating M i1 is obtained by interchanging the original pictures A and B and removing the INTRA mode from the set of possible macroblock modes. File:JVT-I032d6.doc Page: 9 Date Saved: 2003-09-01

Given the pictures A and B, which are either original pictures or pictures representing low pass signals generated in a previous analysis stage, and the corresponding arrays of luma samples a[] and b[], the motion description M i0 is estimated in a macroblock-wise manner by the following process: For all possible macroblock and sub-macroblock partitions of a macroblock i inside the picture B, the associated motion vectors m i = [m x, m y ] T are obtained by minimizing the Lagrangian functional m i = arg min m S { ( i, m) + λ R( i, m) } D SAD with the distortion term being given as D SAD ( i, m ) = ( x, y) P b[ x, y] a[ x m, y m ] x y At this, S specifies the motion vector search range inside the reference picture A, P is the area covered by the regarded macroblock or sub-macroblock partition, R(i,m) specifies the number of bits needed to transmit all components of the motion vector m, and λ is a fixed Lagrange multiplier. The motion search proceeds first over all integer-sample accurate motion vectors in the given search range S. Then, given the best integer motion vector, the eight surrounding half-sample accurate motion vectors are tested, and finally, given the best half-sample accurate motion vector, the eight surrounding quarter-sample accurate motion vectors are tested. For the half- and quarter-sample accurate motion vector refinement, the term a[ x mx, y m y ] has to be interpreted as interpolation operator. The mode decision for the macroblock and sub-macroblock modes follows basically the same approach. From a given set of possible macroblock or sub-macroblock modes S mode, the mode p i that minimizes the following Lagrangian functional is chosen: p i = arg min p S mod e { D ( i, p) + R( i, p) } SAD λ. The distortion term is given as D SAD ( i, p) = b[ x, y] a[ x m [ p, x, y], y m [ p, x, y]], ( x, y) P x where P specifies the macroblock or sub-macroblock area and m[p,x,y] is the motion vector associated with the macroblock or sub-macroblock mode p and the partition or submacroblock partition covering the luma location (x, y). The rate term R(i,p) represents the number of bits associated with choosing the coding mode p. For the motion-compensated coding modes it includes the bits for the macroblock type (if applicable), the sub-macroblock type(s) (if applicable) and the motion vector(s); for the INTRA mode, it includes the bits for the macroblock mode and the arrays of quantized luma and chroma transform coefficient levels. The set of possible sub-macroblock types is given by {P_8x8, P_8x4, P_4x8, P_4x4}, and the set of possible macroblock types is given by {P_16x16, P_16x8, P_8x16, P_8x8, INTRA}, whereat the INTRA type is only included if a motion field description M i0 that is used for the prediction step is estimated. y File:JVT-I032d6.doc Page: 10 Date Saved: 2003-09-01

The Lagrangian multiplier λ is set in dependence of the base-layer quantization parameter for the high-pass picture(s) QP Hi of the decomposition stage, for which the motion field is estimated: λ 0.33 2^( QP / 3 4). = Hi 3.3 Temporal placement of the low-pass signals The basic two-channel analysis filterbank decomposes two input pictures A and B into a lowpass picture L and a high-pass picture H. Following the notation in this contribution, the lowpass picture L shares the coordinate system with the original picture A. Thus, assuming perfect (error-free) motion compensation, the pictures A and L are identical. The decomposition structure depicted in Figure 1 is obtained if in all decomposition stages, the even input pictures at temporal sampling positions 0, 2, 4,... are treated as input pictures A and the odd input pictures at temporal sampling positions 1, 3, 5, are treated as input pictures B. This scheme enables efficient temporal scalability allowing temporal sub-sampling down to very small frame rates. However, the temporal distance between the pictures that are decomposed in each two-channel analysis filterbank is increased by a factor of 2 from one decomposition stage to the next. And it is well known that the efficiency of motion compensated prediction decreases when the temporal distance between the reference picture and the picture to be predicted increases. It is possible to realize decomposition schemes in which the temporal distance between the pictures to be decomposed by the two-channel filterbank are increased by a factor less than 2 from one decomposition stage to the next. However, these schemes don t provide the feature of efficient temporal scalability allowing temporal sub-sampling down to very small frame rates, since the distances between neighboring low-pass pictures varies in most of the decomposition stages. In our simulations, we used the decomposition scheme depicted in Figure 4, which we believe constitutes a reasonable compromise between temporal scalability and coding efficiency. The sequence of original pictures is treated as a sequence of input pictures ABABAB AB; thus, this scheme provides one stage of optimal temporal scalability (equal distance between the low-pass pictures). The sequences of low-pass pictures used as input of all following decomposition stages are treated as a sequences of input pictures BAABBA AB, whereby the distances between the low-pass pictures that will be decomposed in following two-channel analysis steps is kept small. File:JVT-I032d6.doc Page: 11 Date Saved: 2003-09-01

original GOP A B A B A B A B A B A B A B A B 1st stage B A A B B A A B 2nd stage B A A B 3rd stage B A 4th stage Figure 4: Temporal placement of low-pass pictures for a group of 16 pictures. 4. Results For evaluating the coding efficiency of the proposed SNR-scalable extension of H.264/AVC, we compared it to an H.264/AVC compliant encoder using a similar degree of encoder optimization. The set of input sequences for this comparison consists of six test sequences with widely varying content; all sequences have been encoded using different resolutions and frame rates as depicted in Table 1. Table 1: Test sequences Sequence Duration Resolution, frame rate Basketball 6.4 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Flowers & Garden 6.4 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Foreman 9.6 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Mobile & Calendar 9.6 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Paris 9.6 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz Tempete 6.4 sec QCIF, 10Hz QCIF, 15Hz QCIF, 30Hz CIF, 15Hz CIF, 30Hz For all sequences, resolutions, and frame rates, two versions of the scalable encoder have been tested. In the first version ( Scalable ), the temporal low-pass pictures of each group of pictures (GOP) are coded independently as IDR pictures; in the second version ( Scalable+DPCM ), only the temporal low-pass picture of the first group of pictures is coded as IDR picture, all remaining temporal low-pass pictures are coded as P pictures using the reconstructed temporal low-pass pictures of the previous GOP as reference. The video sequences generally are processed in groups of 16 pictures using the encoder control described in section 3. The rate-distortion curves for the scalable encoders have been obtained by decoding a different amount of enhancement layers of a single scalable bit-stream, which consists of one base and three enhancement layers. The quantization parameter for the temporal low-pass picture of the base layer QP L(4) was set to 28, and for generating the enhancement layers the quantization parameters have been decreased by a value of 6 from one layer to the next. File:JVT-I032d6.doc Page: 12 Date Saved: 2003-09-01

The H.264/AVC compliant encoder was run with three different configurations. By using the first configuration ( IDR16, 1ref ), an IDR picture is inserted every 16 pictures, the remaining pictures are coded as P pictures, and only the previous reconstructed picture is used as reference for motion-compensated prediction in the P pictures. With the second configuration ( IDR0, 1ref ), only the first picture of a video sequence is coded as IDR picture, and all following pictures are coded as P pictures using a single reference picture for motion-compensated prediction. These two encoder configurations are considered as reasonable references for the two versions of the scalable encoder. The rate-distortion curves have been obtained by encoding the video sequences with fixed quantization parameters QP {40, 36, 32, 38, 24, 20}. We added a third encoder configuration ( IDR0, 2B, 5ref ) to the comparison, which is considered as representing nearly optimal H.264/AVC compliant encoding results. With this configuration, only the first picture of a video sequence is coded as IDR picture, two B pictures are inserted between each pair of P pictures, and 5 reference pictures are used. All three H.264/AVC compliant encoders are operated using the Lagrangian coder control described in [7], which uses a similar amount of encoder optimization as the operational control that was used for the scalable encoders (see section 3). For all tested encoders, CABAC was used as entropy coding method. The motion estimation was carried out using an exhaustive search. The search range was set to ±16 integer pixels for QCIF and ±32 pixels for CIF sequences if the reference picture represents a neighboring picture of the current picture, and it was enlarged to ±24 integer pixels for QCIF and ±48 integer pixels for CIF sequences if the reference picture does not represent a neighboring picture. Diagrams with the rate-distortion for all tested encoders, test sequences, resolution, and frame rates are contained in the accompanying Excel document. The Figure 5-Figure 10 show the ratedistortion curves for the CIF versions of the test sequences with a frame rate of 30Hz. The simulation results indicate that the coding efficiency of our first version of the SNR-scalable codec strongly depends on the characteristic of the input sequence. For sequences like Mobile & Calendar, Tempete, or Flowers & Garden that mainly contain global motion, the coding efficiency of our scalable codec s is especially in the low bit-rate range nearly comparable to that of the H.264/AVC compliant encoders. The coding efficiency of the scalable encoders drastically decreases in comparison to the H.264/AVC compliant encoders if the input sequences contain complex motion or a large amount of occlusion areas. Generally, the coding efficiency of the scalable encoders decreases if the frame rate is reduced. When looking at the reconstructed sequences it can be seen, that most coding artifacts occur in image regions that are covered or uncovered inside a group of pictures (Foreman, Paris) or that undergo complex motion (Basketball). This indicates that our simple bit-allocation algorithm, which uses a fixed quantization parameter inside a temporal subband picture, is not optimal. This is related to the fact that the orthonormality of the filterbank is only approximately given if the motion field used in the update step is the inverse of the motion field used in the prediction step. File:JVT-I032d6.doc Page: 13 Date Saved: 2003-09-01

Y-PSNR [db] 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 Basketball - CIF 30Hz - 192 frames (6.4 sec) DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 1000 2000 3000 4000 5000 6000 bit-rate [kbit/s] Figure 5: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Basketball in CIF resolution with a frame rate of 30Hz. 42 41 Flowers & Garden - CIF 30Hz - 192 frames (6.4 sec) 40 39 38 37 Y-PSNR [db] 36 35 34 33 32 31 30 29 28 27 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 500 1000 1500 2000 2500 3000 3500 4000 4500 bit-rate [kbit/s] Figure 6: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Flowers & Garden in CIF resolution with a frame rate of 30Hz. File:JVT-I032d6.doc Page: 14 Date Saved: 2003-09-01

42 Foreman - CIF 30Hz - 288 frames (9.6 sec) 41 40 39 Y-PSNR [db] 38 37 36 35 34 33 32 31 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 200 400 600 800 1000 1200 1400 1600 1800 2000 bit-rate [kbit/s] Figure 7: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Foreman in CIF resolution with a frame rate of 30Hz. 41 40 Mobile & Calendar - CIF 30Hz - 288 frames (9.6 sec) 39 38 37 36 Y-PSNR [db] 35 34 33 32 31 30 29 28 27 26 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 1000 2000 3000 4000 5000 bit-rate [kbit/s] Figure 8: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Mobile & Calendar in CIF resolution with a frame rate of 30Hz. File:JVT-I032d6.doc Page: 15 Date Saved: 2003-09-01

43 Paris - CIF 30Hz - 288 frames (9.6 sec) Y-PSNR [db] 42 41 40 39 38 37 36 35 34 33 32 31 30 29 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 200 400 600 800 1000 1200 1400 1600 1800 2000 bit-rate [kbit/s] Figure 9: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Paris in CIF resolution with a frame rate of 30Hz. 42 Tempete - CIF 30Hz - 192 frames (6.4 sec) Y-PSNR [db] 41 40 39 38 37 36 35 34 33 32 31 30 29 28 DPCM: IDR16, 1ref DPCM: IDR0, 1ref DPCM: IDR0, 2B, 5ref Scalable: GOP16 Scalable: GOP16 + DPCM 0 500 1000 1500 2000 2500 3000 3500 4000 bit-rate [kbit/s] Figure 10: Comparison of the coding efficiency of the H.264/AVC compliant encoder ( DPCM ) and the proposed SNR-scalable extension ( Scalable ) for the sequence Tempete in CIF resolution with a frame rate of 30Hz. File:JVT-I032d6.doc Page: 16 Date Saved: 2003-09-01

5. Future research items In the presented approach, the motion vector arrays M i0 and M i1, which are used in the prediction and update step, respectively, are estimated and encoded independently. This does not only increase the bit-rate needed for transmitting the motion parameters, but probably also has a negative influence on the connectivity of these two motion fields, which seems to have an important influence on the coding efficiency of the subband approach. Thus, we believe that the coding efficiency can be improved if the motion fields M i1 used in the update step are not independently estimated and coded, but derived from the motion fields M i0 used in the prediction steps in a way that they are still representing block-wise motion compatible with the H.264/AVC specification. As a side effect, this might also limit the complexity needed for the update step. Our current analysis-synthesis structure represents a lifting representation of the simple Haar filters. This scheme can be extended to a lifting representation of the bi-orthogonal 5/3 filters, which lead to bi-predictive motion compensation. The most promising approach is probably to adaptively switch between the lifting representations of the Haar filters and that of the 5/3 filters on a block-basis, for which the motion compensated prediction as specified for B slices in H.264/AVC can be applied. In addition, it may be beneficial to adaptively choosing the GOP size for the temporal subband decomposition. Since the usage of multiple reference pictures has improved the performance of hybrid video coding schemes considerably, the incorporation of this approach into the subband scheme is an interesting research item. Another important point is the development of more suitable bit-allocation algorithms that reduce the annoying SNR fluctuations inside a group of pictures, which we have observed for some of the test sequences (e.g. Foreman). Furthermore, it should be worth to examine new techniques for transform coefficient coding, which could improve the SNR-scalability and perhaps provide additionally a certain degree of spatial scalability which itself is also on our list of things to be investigated. References [1] T. Wiegand and G. J. Sullivan (eds), Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 ISO/IEC 14496-10 AVC), Doc. JVT-G050r1, May 2003. [2] D. Taubman, Successive refinement of video: fundamental issues, past efforts and new directions, Proc. of SPIE (VCIP 03), vol. 5150, pp. 649-663, 2003. [3] J.-R. Ohm, Complexity and delay analysis of MCTF interframe wavelet structures, ISO/IEC JTC1/WG11 Doc. M8520, July 2002. [4] M. Flierl and B. Girod, "Video coding with motion-compensated lifted wavelet transforms", Proc. PCS 2003. [5] W. Sweldens, A custom-design construction of biorthogonal wavelets, J. Appl. Comp. Harm. Anal., vol. 3 (no. 2), pp. 186-200, 1996. [6] I. Daubechies and W. Sweldens, Factoring wavelet transforms into lifting steps, J. Fourier Anal. Appl., vol. 4 (no. 3), pp. 247-269, 1998. [7] T. Wiegand et al, Rate-Constrained Coder Control and Comparison of Video Coding Standards, IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 688-703, July 2003. File:JVT-I032d6.doc Page: 17 Date Saved: 2003-09-01