THE H.264, the newest hybrid video compression standard

1774 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007 Rate Distortion Optimization for H.264 Interframe Coding: A General Framework and Algorithms En-Hui Yang, Senior Member, IEEE, and Xiang Yu Abstract Rate distortion (RD) optimization for H.264 interframe coding with complete baseline decoding compatibility is investigated on a frame basis. Using soft decision quantization (SDQ) rather than the standard hard decision quantization, we first establish a general framework in which motion estimation, quantization, and entropy coding (in H.264) for the current frame can be jointly designed to minimize a true RD cost given previously coded reference frames. We then propose three RD optimization algorithms a graph-based algorithm for near optimal SDQ in H.264 baseline encoding given motion estimation and quantization step sizes, an algorithm for near optimal residual coding in H.264 baseline encoding given motion estimation, and an iterative overall algorithm to optimize H.264 baseline encoding for each individual frame given previously coded reference frames with them embedded in the indicated order. The graph-based algorithm for near optimal SDQ is the core; given motion estimation and quantization step sizes, it is guaranteed to perform optimal SDQ if the weak adjacent block dependency utilized in the context adaptive variable length coding of H.264 is ignored for optimization. The proposed algorithms have been implemented based on the reference encoder JM82 of H.264 with complete compatibility to the baseline profile. Experiments show that for a set of typical video testing sequences, the graph-based algorithm for near optimal SDQ, the algorithm for near optimal residual coding, and the overall algorithm achieve on average, 6%, 8%, and 12%, respectively, rate reduction at the same PSNR (ranging from 30 to 38 db) when compared with the RD optimization method implemented in the H.264 reference software. Index Terms Fixed-slope lossy compression, H.264 hybrid coding, rate distortion (RD) optimization, soft decision quantization (SDQ). I. INTRODUCTION THE H.264, the newest hybrid video compression standard [2], has proved its superiority in coding efficiency over its precedents, e.g., it shows a more than 40% rate reduction over H.263 [5]. However, as the enormous volume of video data is constantly fueling the demand for better and better compression [19], [20], it is desirable to study how to further enhance the compression performance in the H.264 standard-compliant coding environment. Manuscript received December 8, 2006; March 1, 2007. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grants RGPIN203035-02 and RGPIN203035-06 and under Collaborative Research and Development Grant, in part by the Premier s Research Excellence Award, in part by the Canadian Foundation for Innovation, in part by the Ontario Distinguished Researcher Award, and in part by the Canada Research Chairs Program. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Tamas Sziranyi. The authors are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, ON N2L 3G1 Canada (e-mail: ehyang@uwaterloo.ca; x23yu@bbcr.uwaterloo.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2007.896685 Fig. 1. Illustration of a hybrid coding structure. H.264 utilizes a well-known hybrid structure, as shown in Fig. 1. Specifically, since the quantization part introduces permanent information loss to video data, the hybrid scheme leads to a lossy compression, whose performance is characterized by the rate distortion (RD) function of the source [1]. The four coding parts all contribute to the RD function and there is no easy way to quantitatively separate their contributions. Therefore, the fundamental tradeoff in the design of a hybrid video compression system including H.264 is its overall RD performance, based on which many optimization methods, broadly referred to as RD methods, have been developed and widely used in video compression applications [5], [20]. RD methods for video compression can be classified into two categories. The first category computes the theoretical RD function based on a given statistic model for video data, e.g., [15] [17]. In general, the challenge for designing a method in the first category is the model mismatch due to the nonstationary nature of video data. The second category uses an operational RD function, which is computed based on the data to be compressed. Mainly, there are two problems. First, in most operational RD methods, the formulated optimization problem is restricted and the RD cost is optimized only over motion estimation and quantization step sizes. Second, there is no simple way to solve the restricted optimization problem if the actual RD cost is used. By the actual RD cost, we mean a cost based on the final reconstruction error and the entire coding rate. Because hard decision quantization (HDQ) is used, there is no simple analytic formula to represent the actual RD cost as a function of motion estimation and quantization step sizes, and, hence, a brute force approach with high computational complexity is likely to be used to solve the restricted optimization problem [20]. For this reason, an approximate RD cost is often used in the restricted optimization problem in many operational RD methods. For example, the optimization of motion estimation in [5] is based on the prediction error instead of the actual distortion, which is the quantization error. This paper proposes an operational RD method using the actual RD cost. The target is RD optimization for hybrid video coding subject to syntax constraints of H.264 baseline profile. We first discuss a somewhat hidden parameter to be optimized in addition to prediction mode, reference frame 1057-7149/$25.00 2007 IEEE

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING 1775 indexes, motion vectors and quantization step sizes, and formulate a joint optimization framework. Specifically, using soft decision quantization (SDQ) instead of HDQ, we notice that the quantized residual itself is a free parameter that can be optimized in order to improve compression performance. By SDQ, entropy coding is brought into the quantization design. The general optimization framework can then be formulated as jointly designing motion estimation, quantization, and entropy coding in the H.264 hybrid video coding structure. Surprisingly, this generality not only improves the compression performance in term of the RD tradeoff, but also makes the optimization problem tractable at least algorithmically. Indeed, with respect to the baseline profile of H.264, we propose three RD optimization algorithms a graph-based algorithm for near optimal SDQ, an algorithm for near optimal residual coding, and an iterative overall algorithm to optimize H.264 baseline profile encoding with them embedded in the indicated order. The SDQ algorithm is the core. It helps to bring all coding components into the optimization scheme with the actual RD cost being its objective function, enabling us to jointly design them in the hybrid coding structure. The proposed RD optimization algorithms for H.264 video coding are inspired by a fixed-slope universal lossy data compression scheme 1 considered in [7], which was first initiated in [6], and was later extended in [8]. Other related works on practical SDQ include without limitation SDQ in JPEG image coding and H.263 video coding (see [9] [11], [24], [25], and references therein). In [9] and [10], partial SDQ called rate-distortion optimal thresholding was considered. Recently, Yang and Wang [11] successfully developed an algorithm for optimal SDQ in JPEG image coding to further improve the compression performance of a standard JPEG image codec. Without considering optimization over motion estimation and quantization step sizes, Wen et al. [24] proposed a trellis-based algorithm for optimal SDQ in H.263 video coding, which, however, is not applicable to SDQ design in H.264 due to the inherent difference in the entropy coding stages of H.264 and H.263. In [25], Schumitsch et al. studied interframe optimization of transform coefficient levels 2 based on a simplified linear model of interframe dependencies. Although the SDQ principle is not new and this paper is not the first attempt to apply SDQ to practical coding standards either, designing algorithms for optimal or near optimal SDQ in conjunction with a specific entropy coding method is still quite challenging, especially when the involved entropy coding method is complicated. Different entropy coding methods require different algorithms for SDQ. In some cases, for example, SDQ for GIF/PNG coding where the entropy coding methods are the Lempel Ziv [29], [30] algorithms, the SDQ design problem is still open [31]. Fortunately, in the case of H.264, we are able to tackle the SDQ design issue associated with the context adaptive variable length coding (CAVLC) of H.264 by putting it into the fixed slope framework. Furthermore, our studies in SDQ within the fixed slope scheme con- 1 Related to fixed slope compression are entropy constrained [14] and conditional entropy constrained scalar/vector quantization. See [8] and [10] for their difference and similarity. 2 Transform coeffcient levels are also referred to as quantized transform coefficients. stitutionally leads to a new framework for jointly designing all key components in Fig. 1 for hybrid video coding. Application of the proposed framework to the syntax constrained optimization for H.264 has shown a significant improvement on the RD performance. This paper is organized as follows. In Section II, we review the hybrid coding in H.264 and RD optimization methods for video compression in the literature. In Section III, we develop a framework for jointly designing the hybrid coding structure in H.264, with discussions on algorithm designs for residual coding optimization, motion estimation, and the overall joint optimization. Section IV is then dedicated to the core algorithm of SDQ based on CAVLC. Experiment results are presented in Section V, and, finally, conclusions are drawn in Section VI. II. BACKGROUND RD optimization of hybrid video coding with H.264 compatibility is subject to decoding syntax constraints specified in the standard. This section reviews the hybrid coding in H.264 and some related RD methods. A. Hybrid Video Compression in H.264 The motion estimation design in H.264 has been significantly improved over previous standards. It allows various block sizes from 4 4to16 16. It also uses a higher prediction accuracy of -pixel. According to Girod s study [18], this is the highest precision that is required in order to achieve the best performance for motion estimation. For the transform part, H.264 uses the well-known discrete cosine transform (DCT) with a block size of 4 4 in its baseline profile. Quantization in H.264 is simply achieved by a scalar quantizer. It is defined by 52 step. The quan- sizes based on an index parameter tization step size for a given is specified as where % and are the remainder and quotient of divided by 6, and. For the purpose of fast implementation, quantization and transform in H.264 are combined together. Specifically, suppose that the decoder receives the quantized transform coefficients and the quantization parameter for a 4 4 block. Then the de-quantization and inverse transform are performed together as follows: where and are constants defined in the decoding syntax of H.264 (see [21] for details). H.264 supports two entropy coding methods for residual coding, i.e., CAVLC [4] and context adaptive binary arithmetic coding (CABAC) [2]. In the baseline profile, only CAVLC is supported. As discussed above, each individual coding part in H.264 has been well designed to achieve good coding performance using the state-of-the-art technologies. Optimization of an individual part in H.264 alone will unlikely bring much improvement. Meanwhile, a joint optimal design of the whole encoding structure is possible because the standard only specifies a syntax (1) (2)

1776 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007 Fig. 2. Signal flow of a typical hybrid codec as in H.264. for the coded bit stream, leaving details of the encoding process open to a designer. In this paper, we propose a joint optimization framework and its algorithm designs for hybrid video coding with complete decoding compatibility to the H.264 baseline profile. B. Review of Related Rate Distortion Optimization Work Using the generalized Lagrangian multiplier method [22], Wiegand et al. proposed a simple, effective operational RD method for motion estimation optimization [5], [13]. The mode selection for motion estimation is conducted based on the actual RD cost in a macroblock-by-macroblock manner. For a given prediction mode, motion estimation is optimized based on an operational RD cost, which approximates the actual RD cost, as follows: where stands for the original image block, is the prediction with given prediction mode, reference index, and motion vector is a distortion measure, is the number of bits for coding is the number of bits for coding, and is the Lagrangian multiplier. Wen et al. [24] proposed an operational RD method for residual coding optimization in H.263 using a trellis-based SDQ design. In H.263, residuals are coded with run-length codes followed by variable length coding (VLC). The VLC in H.263 is simple and does not introduce any dependency among neighboring coefficients, while the dependency mainly comes from the run-length code. Therefore, a trellis structure is used to decouple the dependency so that a dynamic programming algorithm can be used to find the optimal path for quantization decisions. In the baseline of H.264, however, context adaptive VLC is used after the run-length coding. The context adaptivity introduces great dependency among neighboring coefficients; thus, a new design criterion is needed to handle the context adaptivity for designing SDQ in H.264. A recent study on SDQ in [25] developed a linear model of interframe dependencies and a simplified rate model to formulate an optimization problem for computing the quantization outputs using a quadratic program. From the problem formulation point of view, our SDQ problem formulation shares the same spirit as that in [25], except that the latter one is more ambitious as it targets interframe dependencies. From the algorithm design point of view, [25] gives an optimized determination of transform coefficient levels by considering temporal dependencies, but neglecting other factors such as the specific entropy coding (3) method, while the graph-based SDQ design to be presented later in this paper provides the optimal SDQ under certain conditions, i.e., prediction is given and CAVLC is used for entropy coding. III. SYNTAX-CONSTRAINED OPTIMIZATION FRAMEWORK FOR H.264 INTERFRAME COMPRESSION In this section, we investigate the syntax-constrained optimization problem for H.264 video coding. By exploring all possible optimization variabilities within the H.264 hybrid coding scheme, we first establish a general framework in which motion estimation, quantization, and entropy coding for the current frame, given previously coded reference frames, can be jointly designed to minimize the actual RD cost, and then present three RD optimization algorithms as a solution to the syntax-constrained optimization problem. A. Problem Formulation Fig. 2 illustrates the signal flow of a typical hybrid encoder as in H.264. Note that the previously coded frames are assumed known in the frame buffer when we discuss optimization of the current frame. For a given distortion measure, the actual reproduction error for coding a whole frame is, where is the reconstruction of. Correspondingly, the entire rate for coding involves in five parts, i.e., the prediction modes, reference frame indexes, motion vectors, quantization step sizes, and quantized transform coefficients.for a given entropy coding method with its rate function, the entire coding rate is. Then, the actual RD cost for coding is where is a positive constant, which is determined by end users based on both the available bandwidth and the expected video quality. From a RD theoretic point of view, a good coding design is to find a set of encoding and decoding algorithms to minimize the actual RD cost as given in (4). However, in the syntax-constrained optimization scenario, the decoding algorithms have already been selected and fixed. Specifically, for a given 4 4 quantized transform coefficient block and the corresponding prediction mode, reference index, motion vector, and quantization step size [note that (4) is defined for a whole frame while H.264 specifies a block-based coding scheme; for simplicity, however, the subscript is omitted (4)

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING 1777 hereafter when the discussion is focused on the block-based coding syntax], the reconstruction is computed by (5) where is defined as in (2). Under this constraint, we examine the maximal variability and flexibility an encoder can enjoy before establishing our optimization problem based on the actual RD cost of (4). Conventionally, the constraint of (5) is used to derive a deterministic quantization procedure, i.e., which mainly minimizes the quantization distortion. The factor is an offset parameter for adapting the quantization outputs to the source distribution to some extend. e.g., there are empirical studies on determining according to the signal statistics to improve the RD compression efficiency. From the syntax-constrained optimization point of view, however, there is no deterministic relationship between and. Indeed, inspired by the fixed-slope lossy data compression scheme in [7], we see that given, each (per block) itself (or equivalently per frame) is a free parameter and one has the flexibility to choose the desired to minimize (4). Such a way of determining (or equivalently ) is called soft decision quantization. The idea of trading off a little distortion for a better RD performance has already been used partially in the H.264 reference software, however, in an ad hoc way. A whole block of quantized coefficients is discarded under certain conditions, e.g., when there is only one nonzero coefficient taking a value of 1 or. This is equivalent to quantizing that coefficient to 0, although a hard decision scalar quantizer outputs 1 or. Such practice is well justified by experimental results [2]. To get better compression performance, it is interesting and desirable to study SDQ in a systematic way. The purpose of SDQ is to minimize the actual RD cost by adapting quantization to a specific entropy coding method. Fig. 3 shows the structure of the fixed-slope lossy compression method. Given a residual block and a quantization step size, the RD optimal residual coding is to solve a minimization problem of where is the actual distortion of quantization error, is the total rate for residual coding, and is a constant, which has an interpretation of the slope of the resulting RD curve. In case of syntax constrained optimization, the decoding mapping and the lossless coding algorithms and are fixed by the standard, i.e., and accord to CAVLC and. In this case, the problem of (7) reduces to finding to minimize the RD cost where is a given quantization step size, and the minimization in (8) is over all possible quantized values. Such a is not achieved, in general, by the hard decision process via (6). (6) (7) (8) Fig. 3. Universal lossy compression scheme for residual coding. Having described SDQ, we can now have the complete syntax-constrained optimization problem for H.264 hybrid video coding as follows: In general, the overall solution to (9) represents the best compression performance an encoder under H.264 syntax constraints can possibly achieve for the current frame given previously encoded frames. The optimization problem (9), together with its solution, gives a general framework in which motion estimation and residual coding for the current frame can be jointly designed to minimize the actual RD cost. B. Problem Solution In general, (9) is difficult due to the mutual dependency among. To make the problem tractable, we propose an iterative solution, in which motion estimation and residual coding are optimized alternately. Specifically, three RD optimization algorithms are developed as follows. 1) Optimal Soft Decision Quantization: Given, in SDQ, we compute (9) (10) Details of our SDQ design based on H.264 baseline coding are presented in the next section. 2) Residual Coding Optimization: Given, in residual coding optimization, we compute (11) Examining the distortion term in (11), we see that it is macroblock wise additive. As will be discussed later in the next section, even though the term is not strictly macroblock-wise additive, the adjacent block dependency used in coding is so weak that we can ignore it in our optimization and simply regard as being block-wise additive. Thus, the main difficulty lies in the term of, which represents a first order predictive coding method [2]. As such, the optimization problem in (11) can not be solved in a macroblock-by-macroblock manner. To tackle the adjacent macro-block dependency from, we develop a trellis structure with stages and 52 states at each stage. Each stage accords to a macro-block, while each state accords to a quantization step size. States between two neighboring stages are fully connected with each other. The RD cost for a transition between the th state at the th stage to the th state at the th stage can be computed by two parts, i.e., the coding rate of and the RD cost for coding the th macro-block using, which is computed using SDQ. The RD cost for each state at the initial stage is equalto the RD cost resulting from encoding the first macro-block using and the

1778 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007 corresponding optimal SDQ. Then, dynamic programming can be used to solve (11). Apparently, the above solution is computationally expensive as it involves in running SDQ for each one of 52 states at each stage and then searching the whole trellis. In practice, however, there is no need for this full scale dynamic programming because the RD cost corresponding to is much greater than that corresponding to. This implies that very likely, the globally optimal quantization step size for each macro-block will be within a small neighboring region around the best quantization step size obtained when is ignored in the cost, and one can apply dynamic programming to a much reduced trellis with states at each stage limited only to such a small neighborhood. To this end, we first propose the following procedure to find the best when is ignored. Step 1) Initialize using the following empirical equation proposed in [13] with a given and (1) (12) Step 2) Compute by the SDQ algorithm. Step 3) Fix. Compute by solving. Asa result, we have, which is then rounded to one of the 52 predefined values in H.264. Step 4) Repeat Steps 2 and 3 until the decrement of the RD cost is less than a prescribed threshold. Simulations show that (12) makes a good initial point. After one iteration, the obtained is quite close to the best quantization step size with being ignored. We then select a neighboring region of to build up the trellis at stage, and, hence, the computation complexity is greatly reduced. Our experiments show that dynamic programming applied to this reduced trellis achieves almost the same performance as that applied to the full trellis. 3) Joint Optimization Algorithm: Based on the algorithm for the near optimal residual coding, a joint optimization algorithm for solving (9) is proposed to alternately optimize motion estimation and residual coding as follows. Step 1) (Motion estimation) For given residual reconstruction, we compute by Compare (14) with (3). For given, (14) is equivalent to searching for a prediction to match in (3). Thus, the same search algorithm is used to solve (14) as the one for (3) in [5]. The computational complexity for (14) and (3) is almost the same since the time for computing is ignorable. For a given, the joint optimization algorithm starts with, which is equivalent to using the motion estimation in [5] as a starting point. Experiments show that with this initialization, the algorithm converges very fast; after two iterations, the decrement in the total cost is almost negligible. C. Comparing the Proposed Scheme With Conventional One We first review the conventional optimization framework based on HDQ. By HDQ, quantization outputs are given by a deterministic function with respect to, and, as shown by (6) for H.264. Therefore, in the conventional framework, the true RD cost is minimized over, and, i.e., (15) Comparison between the proposed framework in (9) and the conventional one in (15) reveals two advantages for the proposed framework. First, we have since, for given, we can always apply SDQ in Section III-B1 to reduce the RD cost. Second, the problem of optimizing the true RD cost becomes tractable algorithmically, i.e., as discussed in III-B, an iterative solution is easily established to optimize over and. The solution is at least feasible, although it may not be proved to be globally optimal. On the other hand, with the conventional framework of (15), it is impractical to optimize the true RD cost over, and, because it would require to go through the residual coding procedure to evaluate the cost for all possible, and. Overall, due to SDQ, the new framework supports a better RD performance and features a feasible solution to minimizing the true RD cost for hybrid video coding. (13) which is equivalent to (9) for given. Step 2) (Residual coding) For given, the process in Section III-B2 is used to find. Step 3) Repeat Steps 1 and 2 until the decrement of the actual RD cost is less than a given threshold. We now study the solution to (13), which involves mode selection and motion estimation. In [5], the prediction mode is selected for each macroblock by computing the actual RD cost corresponding to each mode and choosing the one with the minimum. This method of mode selection is also used in this paper. Then, for a pixel block with its residual reconstruction and a given mode is computed by (14) IV. SOFT DECISION QUANTIZATION ALGORITHM DESIGN In this section, we presents our core graph-based SDQ algorithm for solving the minimization problem given in (10). In general, SDQ is a search in a vector space of quantization outputs for a tradeoff between the quality and the rate. The efficiency of the search largely depends on how we may discover and utilize the structure of the vector space, which features the de-quantization syntax and the entropy coding method of CAVLC. In this paper, we propose to use dynamic programming techniques to do the search, which requires an additive evaluation of the RD cost. In the following, we first show the additive distortion computation in DCT domain based on the de-quantization syntax reviewed in Section II-A. Second, we design a graph for additive evaluation of the rate based on analysis of CAVLC, with states being defined according to the level coding and connections being specified according to the run coding. Finally, we discuss the optimality of the graph-based algorithm, showing that the graph design helps to solve the minimization problem of (10).

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING 1779 Fig. 4. Graph structure for SDQ based on CAVLC. There are 16 columns according to 16 coefficients. A column consists of multiple state groups, according to different ZL. The left panel shows the connections between these groups. Each group initially contains a set of states defined on the right panel, while eventually only those states that receive valid connections remain. A. Distortion Computation in DCT Domain The distortion term in (10) is additive in the pixel domain. However, it contains inverse DCT, which is not only time consuming, but also makes the optimization problem intractable. Consider that DCT is a unitary transform, which maintains the Euclidean distance. We choose the Euclidean distance for so that the distortion can be computed in the transform domain in an additive manner. Specifically, for a given residual block, the distortion is computed as in [3] by (16) where, and are constants specified by the standard syntax. This equation brings to us two advantages. The first is high efficiency for computing distortion. Note that is computed before SDQ for given. Thus, the evaluation of consumes only two integer multiplications together with some shifts and additions per coefficient. More importantly, the second advantage is the resulted element-wise additive computation of distortion, which enables us to solve the SDQ problem using Viterbi algorithm, as to be presented later. B. Graph Design for Soft Decision Quantization While CAVLC is designed for each individual block, the coding for CoeffToken (see [4] for details) introduces certain dependency among neighboring blocks. However, the dependency is very weak. Therefore, in the optimization problem given in (10) for the whole frame, we will decouple such weak dependency. In doing so, the optimization of the whole frame can be solved in a block by block manner with each block being 4 4. That is, the optimal can be determined independently for each. By omitting the subscript, the optimization problem given in (10) reduces to (17) where is the number of bits needed for CAVLC to encode given that its two neighboring blocks have been optimized. Apply the result of (16) to (17). The problem becomes (18) Note that every bold symbol here, e.g.,, represents a 4 4 matrix. For entropy coding, the 4 4 matrix of will be zig-zag ordered into a 1 16 sequence. To facilitate the following discussion on graph design, we introduce a new denotation, i.e., to add a bar on the top of a bold symbol to indicate the zig-zag ordered sequence of the corresponding matrix. Then, the equation of (18) is rewritten as follows: (19) where we still use the symbol to indicate the element-wise multiplication between two vectors. The problem of (19) is equivalent to a search in a vector space of. We now construct a graph, as shown in Fig. 4, to represent this vector space. In the designed graph, each transition stands for a run level pair, while each path from the initial state HOS to the end state EOS gives a unique sequence of. Moreover, the graph enables an additive rate evaluation corresponding to CAVLC. In the following, we give more details on how to construct this graph. 1) Definition of States According to CAVLC Level Coding: CAVLC encodes levels based on adaptive contexts, which are used to select VLC tables. These adaptive contexts are represented by different states in Graph 4. Let us first examine the trailing one coding rule (see [4] for details). The trailing ones are a set of levels with three features. First, they must be handled at the beginning of the coding process (note that coding is conducted in reverse order of the zig-zag sequence). Second, they are consecutive. Third, there is a restriction of to consider, at most, 3 of them. To meet these three requirements, we design three types of states,. In addition, CAVLC requires to know the number of trailing ones, i.e.,, both at the beginning of the coding process ( is transmitted) and at the

1780 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007 Fig. 5. Left panel: States and connections defined according to the trailing one coding rule of CAVLC. HOS is a dummy state, indicating the start of encoding. Right panel: States and connections defined according to the level coding process of CAVLC. point that the level coding table is initialized. As such, we define 6 states, Tn3H, Tn2H, Tn1H, Tn2T, Tn1T, and Tn1TH as shown in the left panel of Fig. 5, where TnjH in the column of represents that is the first trailing one and in the column of represents that is the th trailing one and, and Tn1TH in the column of represents that is the second trailing one and. Hereafter, these states are also referred to as T-states. More states are defined based on features for coding levels other than trailing ones. The important factors for coding these levels are the seven coding tables and the table selection criteria. Specifically, denote the seven tables as Vlc(0) Vlc(6), and the corresponding thresholds for table selection as. Note that, meaning that always switches to another table. Also, is beyond the range of a possible output, meaning that once is selected, it will be used until the end of the current block. Other than these, the coding table will be switched from to when the current level is greater than for. Therefore, each coding table except needs to have two states in order to determine the context to choose a coding table for the next level according to the current level. As shown in the right panel of Fig. 5, there are 13 states defined, named as either V or. These states are refereed to as V-states. 2) Definition of State Groups According to Run Coding: Now we examine the runs coding process of CAVLC and explain why and how states are clustered into groups. The context for choosing a table to code runs depends on a parameter of ZerosLeft (referred to as hereafter), which involves in future states in the graph structure. To build this dependency into the definition of states, we define a state group for each different. As shown in Fig. 4, a state group initially consists of all T-states and V-states. For the column of coefficient, there are groups, corresponding to. Besides the run coding table selection, the formation of state groups according to provides other two advantages. First, it naturally leads us to know TotalZeros for every path in the graph. Second, it enables us to include the coding rate of CoeffToken in the optimization process by providing the value of (refereed to as hereafter). In addition, is also used to initialize the level coding table. 3) Connecting States to Build Up a Graph: Connections from one column to another are now established in two steps. The first is to connect state groups, and the second is to further clarify connections between states in two connected groups. Specifically, HOS is connected to all groups, while a group in the column of is connected to EOS only if its equals to. Moreover, consider the th group in the column of with being and the th group in the column of with being, where. These two groups are connected if and only if. The outcome of this rule is illustrated in Fig. 4. Now we discuss connections between two groups. First, two rules are defined as and between T-states as shown in the left panel of Fig. 5. Second, connections between V-states are established by two rules, as illustrated in the right panel of Fig. 5. 1) The state will go to both and. 2) The state will go to both and. Third, we utilize the level coding table initialization rule to set up other necessary connections including those from the initial state HOS and those to the end state EOS. 1) Connections from HOS to T-states. HOS is connected to Tn3H in the column corresponding to when ; HOS is connected to Tn2H in the column corresponding to when ; HOS is connected to all Tn1H states. 2) Connections from HOS to V-states in a group with in the column corresponding to : This is for the case where. Connect HOS to and if ; Connect HOS to and if. 3) Connections from Tn1H to V-states in a group with in the column corresponding to : This is for the case where. Connect Tn1H to and if ; Connect Tn1H to and if. 4) Connections from Tn1TH to V-states in a group with in the column corresponding to : This is for the case where. Connect Tn1TH to and if ; Connect Tn1TH to and if. 5) Connecting Tn1T to and. Eventually, while each group initially contains 19 states as shown in Fig. 4, only those states that receive valid connections remain. The graph ends at a dummy state EOS. 4) Metric Assignment: In general, because the output of a V-state can be any integer within a given range, there exist multiple transitions, called parallel transitions for a connection to a V-state. Consider a connection from a state in the column of to a state in the column of. Denote the output range of as. There will be parallel transitions from to, with each according to a unique output. Now, we assign metrics to three types of transitions, i.e., a transition starting from HOS, a transition ending at EOS, and a transition from a state in the column of to another state

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING 1781 in the column of. The metric for a transition from HOS to in the column of is, which is (20) where the first term is the distortion for quantizing to zeros as the encoding starts with, the last two terms accord to the RD cost for quantizing to, and is the th element of the constant vector in (18). The metric for a transition from in the column of to in the column of is defined as as (21) where the first term computes the distortion for quantizing coefficients to zero, the second term is the rate cost for coding the run with given by the run coding table at state, the last two terms are the RD cost for quantizing to with determined by the level coding table at state. Finally, for a transition from a state in the column corresponding to to EOS, the RD cost is (22) which accords to the distortion for quantizing all remaining coefficients from to to zeros. C. Algorithm, Optimality, and Complexity With the above metric assignments, the problem of (19) can be solved by running dynamic programming over Graph 4. In other words, the optimal path resulting from dynamic programming applied to Graph 4 will give rise to an optimal solution to (19), as shown in the following theorem. Theorem: Given a 4 4 residual block, applying dynamic programming for a search in the proposed graph gives the optimal solution to the SDQ problem of (19). The proof of the above theorem is sketched as follows. For a given input sequence, any possible sequence of quantization outputs accords to a path in the proposed graph, and vice versa. Define a metric for each transition in the graph as by (20) to (22). Carefully examining details of CAVLC will show that the accumulated metric along any path leads to the same value as evaluating the RD cost in (19) for the corresponding output sequence. Thus, when dynamic programming e.g., the viterbi algorithm, is applied to find the path with the minimize RD cost, the obtained path gives the quantization output sequence to solve (19). The complexity of the proposed graph-based SDQ algorithm (i.e., dynamic programming applied to Graph 4) mainly depends on three factors, i.e., the number of columns as 16, the number of states in each column, and the number of parallel transitions for each connection. Expansion of Graph 4 into a full graph reveals that the number of states varies from 17 to 171. With states selectively connected, the major computational cost is to handle the parallel transitions. For a connection from a state in one column to a state in another column, the number of parallel transitions is, where is the range of all possible quantization outputs at the state. From (20) and (21), it follows that the only difference among the RD costs assigned to these parallel transitions is in the RD costs arising from different quantization outputs. Studies on CAVLC show the rate variation due to different is insignificant compared to the quadratic distortion. This implies that very likely, the quantization output for the optimal transition be within a small neighboring region around the hard-decision quantization output, which minimizes the quadratic distortion. Thus, the number of parallel transitions to be examined in practice could be much smaller. Our experiments show that it is sufficient to compare as few as four parallel transitions around, and, hence, the complexity is reduced to a fairly low level. V. EXPERIMENTAL RESULTS Experiments have been conducted to study the coding performance of the proposed three algorithms for SDQ, residual coding optimization, and overall joint optimization. These algorithms are implemented based on the H.264 reference software Jm82 [26]. The B frame is not used since we target baseline decoder compatibility. Each sequence is divided into and encoded by groups of frames. In each group, there is one standard I-frame, 3 while all the subsequent frames are coded as P-frames. Experiment results are reported with a group size of 21. The range for full-pixel motion estimation is, and five reference frames are used for motion estimation. Comparative studies of the coding performance are shown by RD curves, with the distortion being measured by PSNR defined as, where MSE is the mean square error. Fig. 6 shows the RD curves for coding various sequences. The RD performance is measured over P-frames only since I-frames are not optimized. The result is reported on the luma component as usual. Comparisons are conducted among four encoders, i.e., a baseline encoder with the proposed overall joint optimization method, a main-profile reference encoder with the RD optimization method in [5] and CABAC (the coding setting of this encoder is the same as that of a baseline profile except that CABAC is used instead of CAVLC), a baseline reference encoder with the RD optimization method in [5], and a baseline reference encoder with compromised RD optimization. 4 The RD curve for the proposed method is obtained by varying the slope in (4), while RD curves for other methods result from varying the quantization step size. Specifically, the six points on the curve of the proposed joint optimization method accord to. As illustrated in Fig. 6, the baseline encoder with the proposed overall joint optimization method achieves a significant rate reduction over the baseline reference encoder with the RD optimization in [5]. Moreover, experiments over a set of eight video sequences (i.e., Highway, Carphone, Foreman, Salesman, 3 Intraframes are not optimized in this paper. The joint optimization is designed based on interprediction. However, the proposed SDQ is applicable to residual coding for intraframes. 4 This is conducted by disabling the RD optimization option in the JM software. In this case, empirical formulas are used to compute the RD cost for mode selection, resulting in a compromised RD performance.

1782 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007 Fig. 6. RD curves of four coding methods for coding video sequences of Foreman, Highway, and Carphone. Fig. 7. Comparison of the coding gain for the proposed three algorithms, Enc(SDQ); Enc(SDQ + QP), and Enc(SDQ + QP + ME). Silent, Container, Mother-Daughter, Grandma) show the proposed joint optimization method achieves an average 12% rate reduction while preserving the same PSNR over the RD optimization in [5] with the baseline profile, and 23% rate reduction over the baseline encoder with compromised RD optimization. Fig. 7 compares the coding gain for the proposed three algorithms. For simplicity, the encoders with proposed algorithms are referred to as, and, while the forth encoder is called Enc (baseline, [5]). For, motion estimation and quantization step sizes are computed using the baseline method in [5]; for, the proposed residual coding optimization is performed based on the motion estimation obtained using the baseline method in [5]. It is shown that approximately, half of the gain for overall joint optimization comes from SDQ, 5 while QP and ME contribute the other half gain together. On average, our experiments show rate reductions of 6%, 8%, and 12% while preserving PSNR by 5 It may be interesting to relate the SDQ gain to the picture texture. In general, they can be related to each other qualitatively through the effectiveness of motion estimation, i.e., the gain from SDQ is higher when the energy of residual signals is greater. Usually, this accords to a less effective motion estimation, which may be observed for highly textured pictures., and, respectively, over Enc (baseline, [5]). In term of program execution time with our current implementation, the baseline encoder using RD optimization of [5] takes 1 s to encode a P frame; SDQ adds 1 s for each P frame; takes 6 s to encode each frame; and the overall optimization with takes 15 s per frame. The complexity of comes from the process to explore a neighboring region of five quantization step sizes. The complexity of the overall algorithm mainly comes from the iterative procedure, for which two iterations are used since by observation the RD cost does not decreases much after two iterations. Frankly, the current implementation is not efficient, and there is plenty of room to improve the software structure and efficiency. Meanwhile, compared with the RD method in [5] and the compromised RD method, the proposed approach seeks for better RD performance while maintaining the decoding complexity. It targets off-line applications such as video delivery, for which the RD performance is more important and a complicated encoder is normally acceptable since encoding is carried out only once. The proposed joint optimization algorithm works in a frame-by-frame manner. Clearly, the optimization of the current P-frame encoding will impact on the coding of the next

YANG AND YU: RATE DISTORTION OPTIMIZATION FOR H.264 INTERFRAME CODING 1783 Fig. 8. Relative rate savings averaged over various numbers of frames for coding the sequence of Salesman. P-frame. Thus, it is interesting to see such impact as the number of optimized P-frames increases. Fig. 8 shows the results of the relative rate savings (see its definition in [5]) of the proposed joint optimization algorithm over the baseline reference encoder with compromised RD optimization for various numbers of P-frames. Also shown in Fig. 8 is the result for the RD method in [5]. Although the proposed joint optimization algorithm constantly provides better gains than the RD method in [5], the relative rate savings decreases as increases in both cases. This warrants the joint optimization of a group of frames, which is left open for future research. VI. CONCLUSION AND DISCUSSION Using SDQ, we have proposed a general framework in which motion estimation, quantization, and entropy coding in the hybrid coding structure for the current frame can be jointly designed to minimize a true RD cost given previously coded reference frames. Within the framework, we have then developed three RD optimization algorithms a graph-based algorithm for near optimal SDQ in H.264 baseline encoding given motion estimation and quantization step sizes, an algorithm for near optimal residual coding in H.264 baseline encoding given motion estimation, and an iterative overall algorithm to optimize H.264 baseline encoding for each individual frame given previously coded reference frames with them embedded in the indicated order. It has been shown that if the weak adjacent block dependency utilized in CAVLC of H.264 is ignored for optimization, the proposed graph-based algorithm for SDQ is indeed optimal, and so is the algorithm for residual coding. These algorithms have been implemented based on the reference encoder JM82 of H.264 with complete compatibility to the baseline profile. Experiments have demonstrated that for a set of typical video testing sequences, the graph-based SDQ algorithm, the algorithm for residue coding, and the iterative overall algorithm achieve on average, 6%, 8%, and 12%, respectively, rate reduction at the same PSNR (ranging from 30 to 38 db) when compared with the RD optimization method implemented in the H.264 reference software. Although we have focused mainly on H.264, especially its baseline profile, our proposed optimization framework is applicable to other hybrid video coding methods such as H.263, MPEG2, and MPEG4 as well. Of course, the detailed optimization algorithm design, especially SDQ design, will depend on each specific video coding method. The SDQ design proposed in this paper is based on CAVLC in H.264. To improve the coding performance of the main profile encoder for H.264, SDQ can be designed based on the CABAC method and be embedded into the joint optimization framework, as shown in [28]. Many problems concerning RD optimization both within and beyond our proposed framework remain open, however. For example, within the proposed framework, it is interesting to see how to further reduce the computation complexity of the proposed algorithm for residual coding and the iterative overall joint optimization algorithm while maintaining the RD performance. It is also interesting to seek for an optimal solution to (13). A more challenging problem is to extend our proposed optimization framework to the joint optimization of a group of frames. These issues are left open for future research. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their constructive comments, which have helped to improve the presentation of this paper. REFERENCES [1] T. Berger, Rate Distortion Theory-A Mathematical Basis for Data Compression. Englewood Cliffs, NJ: Prentice-Hall, 1971. [2] T. Wiegand, G. J. Sullivan, and A. Luthra, Draft ITU-T Rec. H.264/ISO/IEC 14496-10 AVC, presented at the JVT ISO/IEC MPEG, ITU-T VCEG, Doc. JVT-G050r1, 2003. [3] E.-H. Yang and X. Yu, On joint optimization of motion compensation, quantization and baseline entropy coding in H.264 with complete decoder compatibility, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Philadelphia, PA, Mar. 2005, pp. II325 328. [4] G. Bjntegaard and K. Lillevold, Context-adaptive VLC (CVLC) coding of coefficients, presented at the JVT-C028, Joint Video Team (JVT) ISO/IEC MPEG, ITU-T VCEG, 3rd Meeting, Fairfax, VA, May 6 10, 2002. [5] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, Rate-constrained coder control and comparison of video coding standards, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688 703, Jul. 2003. [6] E.-H. Yang and S.-Y. Shen, Distortion program-size complexity with respect to a fidelity criterion and rate distortion function, IEEE Trans. Inf. Theory, vol. 39, no. 1, pp. 288 292, Jan. 1993. [7] E.-H. Yang, Z. Zhang, and T. Berger, Fixed-slope universal lossy data compression, IEEE Trans. Inf. Theory, vol. 43, no. 5, pp. 1465 1476, Sep. 1997. [8] E.-H. Yang and Z. Zhang, Variable-rate trellis source encoding, IEEE Trans. Inf. Theory, vol. 45, no. 3, pp. 586 608, Mar. 1999. [9] K. Ramchandran and M. Vetterli, Rate-distortion optimal fast thresholding with complete JPEG/MPEG decoder compatibility, IEEE Trans. Image Process., vol. 3, no. 9, pp. 700 704, Sep. 1994. [10] M. Crouse and K. Ramchandran, Joint thresholding and quantizer selection for transform image coding: Entropy constrained analysis and applications to baseline JPEG, IEEE Trans. Image Process., vol. 6, no. 2, pp. 285 297, Feb. 1997. [11] E.-H. Yang and L. Wang, Joint optimization of run-length coding, Huffman coding and quantization table with complete baseline JPEG decoder compatibility, U.S. Patent Application, 2004.

1784 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 7, JULY 2007 [12] E.-H. Yang and X. Yu, Optimal soft decision quantization design for H.264, in Proc. 9th Canad. Workshop on Information Theory, Montréal, QC, Canada, Jun. 2005, pp. 223 226. [13] T. Wiegand and B. Girod, Lagrangian multiplier selection in hybrid video coder control, in Proc. Int. Conf. Image Processing, Oct. 2001, pp. 542 545. [14] P. A. Chou, T. Lookabaugh, and R. M. Gray, Entropy-constrained vector quantization, IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 1, pp. 31 42, Jan. 1989. [15] W. Ding and B. Liu, Rate control of MPEG video coding and recording by rate quantization modeling, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 2, pp. 12 20, Feb. 1996. [16] H. M. Hang and J. J. Chen, Source model for transform video coder and application-part I: Fundamental theory, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 4, pp. 287 298, Apr. 1997. [17] N. Kamaci and Y. Altunbasak, Frame bit allocation for H.264 using cauchy-distribution based source modelling, in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Philadelphia, PA, Mar. 2005, pp. II57 60. [18] B. Girod, Efficiency analysis of multihypothesis motion-compensated prediction for video coding, IEEE Trans. Image Process., vol. 9, no. 2, pp. 173 183, Feb. 2000. [19] G. J. Sullivan and T. Wiegand, Rate-distortion optimization for video compression, IEEE Signal Process. Mag., vol. 15, no. 6, pp. 74 90, Nov. 1998. [20] A. Ortega and K. Ramchandran, Rate-distortion methods for image and video compression, IEEE Signal Process. Mag., vol. 15, no. 6, pp. 23 49, Nov. 1998. [21] I. E. G. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia. Hoboken, NJ: Wiley, 2003. [22] H. Everett, Generalized lagrange multiplier method for solving problems of optimum allocation of resources, Oper. Res., vol. 11, no. 3, pp. 399 417, Jun. 1963. [23] K. Ramchandran, A. Ortega, and M. Vetterli, Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders, IEEE Trans. Image Process., vol. 3, no. 5, pp. 533 545, Sep. 1994. [24] J. Wen, M. Luttrell, and J. Villasenor, Trellis-based R-D optimal quantization in H.263+, IEEE Trans. Image Process., vol. 9, no. 8, pp. 1431 1434, Aug. 2000. [25] B. Schumitsch, H. Schwarz, and T. Wiegand, Inter-frame optimization of transform coefficient selection in hybrid video coding, presented at the Picture Coding Symp., San Francisco, CA, Dec. 2004. [26] HHI, H.264 Reference Software, [Online]. Available: http://bs.hhi.de/ suehring/tml/ [27] T. Wiegand, M. Lightstone, D. Mukherjee, T. G. Campbell, and S. K. Mitra, Rate-distortion optimized mode selection for very low bit rate video coding and the emerging H.263 standard, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 2, pp. 182 190, Apr. 1996. [28] E.-H. Yang and X. Yu, Rate distortion optimization of H.264 with main profile compatibility, in Proc. IEEE Int. Symp. Information Theory, Seattle, WA, Jul. 9 14, 2006, pp. 282 286. [29] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337 342, May 1977. [30] J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Trans. Inf. Theory, vol. IT-24, no. 5, pp. 530 536, Sep. 1978. [31] E.-H. Yang and J. Zeng, Method, system, and software product for color image encoding, Apr. 23, 2004, U.S. Patent Application 10/831 656. En-Hui Yang (M 97 SM 00) was born in Jiangxi, China, on December 26, 1966. He received the B.S. degree in applied mathematics from HuaQiao University, Qianzhou, China, and the Ph.D. degree in mathematics from Nankai University, Tianjin, China, in 1986 and 1991, respectively. Since June 1997, he has been with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada, where he is now a Professor and Canada Research Chair in information theory and multimedia compression. He held a Visiting Professor position at the Chinese University of Hong Kong from September 2003 to June 2004, positions of Research Associate and Visiting Scientist at the University of Minnesota, Minneapolis-St. Paul; the University of Bielefeld, Bielefeld, Germany; the University of Southern California, Los Angeles, from January 1993 to May 1997; and a faculty position (first as an Assistant Professor and then an Associate Professor) at Nankai University from 1991 to 1992. He is the founding Director of the Leitch University of Waterloo Multimedia Communications Lab and a Co-Founder of SlipStream Data, Inc. (now a subsidiary of Research In Motion). His current research interests are multimedia compression, multimedia watermarking, multimedia transmission, digital communications, information theory, source and channel coding including distributed source coding and space-time coding, Kolmogorov complexity theory, quantum information theory, and applied probability theory and statistics. Dr. Yang is a recipient of several research awards, including the 1992 Tianjin Science and Technology Promotion Award for Young Investigators; the 1992 third Science and Technology Promotion Award of Chinese National Education Committee; the 2000 Ontario Premier s Research Excellence Award, Canada; the 2000 Marsland Award for Research Excellence, University of Waterloo; and the 2002 Ontario Distinguished Researcher Award. Products based on his inventions and commercialized by SlipStream received the 2006 Ontario Global Traders Provincial Award and were deployed by over 2200 Service Providers in more than 50 countries, servicing millions of home subscribers worldwide every day. He served, among many other roles, as a Technical Program Vice-Chair of the 2006 IEEE International Conference on Multimedia & Expo (ICME), the Chair of the award committee for the 2004 Canadian Award in Telecommunications, a Co-Editor of the 2004 Special Issue of the IEEE TRANSACTIONS ON INFORMATION THEORY, a Co-Chair of the 2003 U.S. National Science Foundation (NSF) workshop on the interface of Information Theory and Computer Science, and a Co-Chair of the 2003 Canadian Workshop on Information Theory. Xiang Yu received the M.E. degree in physics in 1994 from Tsinghua University, Beijing, China, and the M.E. degree in electrical engineering from Peking University, Beijing, China, in 1997. He is currently pursuing the Ph.D. degree in electrical and computer engineering at the University of Waterloo, Waterloo, ON, Canada. His research interests include data compression, multimedia communications, information theory, image processing, and machine learning.