INTEGER cosine transform (ICT) was first introduced by

Size: px

Start display at page:

Download "INTEGER cosine transform (ICT) was first introduced by"

Amos Small
5 years ago
Views:

1 84 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 The Technique of Prescaled Integer Transform: Concept, Design and Applications Cixun Zhang, Lu Yu, Member, IEEE, Jian Lou, Student Member, IEEE, Wai-Kuen Cham, Senior Member, IEEE, and Jie Dong Abstract Integer cosine transform (ICT) is adopted by H.264/AVC for its bit-exact implementation and significant complexity reduction compared to the discrete cosine transform (DCT) with an impact in peak sigal-to-noise ratio (PSNR) of less than 0.02 db. In this paper, a new technique, named prescaled integer transform (PIT), is proposed. With PIT, while all the merits of ICT are kept, the implementation complexity of decoder is further reduced compared to corresponding conventional ICT, which is especially important and beneficial for implementation on low-end processors. Since not all PIT kernels are good in respect of coding efficiency, design rules that lead to good PIT kernels are considered in this paper. Different types of PIT and their target applications are examined. Both fixed block-size transform and adaptive block-size transform (ABT) schemes of PIT are also studied. Experimental results show that no penalty in performance is observed with PIT when the PIT kernels employed are derived from the design rules. Up to 0.2 db of improvement in PSNR for all intra frame coding compared to H.264/AVC can be achieved and the subjective quality is also slightly improved when PIT scheme is carefully designed. Using the same concept, a variation of PIT, Post-scaled Integer Transform, can also be potentially designed to simplify the encoder in some special applications. PIT has been adopted in audio video coding standard (AVS), Chinese National Coding standard. Index Terms Adaptive block-size transform (ABT), audio video coding standard (AVS), complexity reduction, discrete cosine transform (DCT), H.264/AVC, integer cosine transform (ICT), prescaled integer transform (PIT), standard, transform, video coding. I. INTRODUCTION INTEGER cosine transform (ICT) was first introduced by W. K. Cham in 1989 [1] and is further developed in recent years. It has been proved that some ICTs have almost the same compression efficiency as the Discrete Cosine Transform (DCT) but much simpler implementation because only additions and shifts operations are needed [2]. Moreover, ICT can avoid inverse Manuscript received September 6, 2005; revised February 14, This work was supported by Natural Science Foundation of China under Grant and Grant This paper was recommended by Associate Editor I. Ahmad. C. Zhang was with Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China. He is now with the Institute of Signal Processing, Tampere University of Technology, Tampere FIN-33101, Finland. L. Yu is with the Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China ( yul@zju.edu.cn). J. Lou was with the Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China. He is now with the Department of Electrical Engineering, University of Washington, Seattle, WA USA. W.-K. Cham is with the Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong. J. Dong was with the Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China. She is now with the Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT transform mismatch problems of the DCT. Due to these advantages, the latest international video coding standard H.264/AVC adopted order-4 and order-8 ICT transforms [3], [4]. In this paper, a technique called prescaled integer transform (PIT) is proposed. PIT can further reduce the implementation complexity of corresponding ICT while no penalty in performance is observed. The paper is organized as follows. Fundamentals of the DCT and ICT are first reviewed in Section II. Then the concept of PIT is introduced and examined in detail in Section III. The proposed PIT scheme and its benefits compared to conventional ICT scheme such as that in H.264/AVC are elaborated. In Section IV, the design rules lead to good PIT kernels in respect of performance are considered. We also find that different types of PITs have different characteristics and are suitable for different applications. Experimental results and analysis of fixed block-size transform (FBT) scheme of PIT are given in Section V. Besides, in Section VI, adaptive block-size transform (ABT) schemes of PIT are also studied and corresponding experimental results and analysis are presented. Section VII concludes the paper. II. FUNDAMENTALS OF THE DISCRETE COSINE TRANSFORM AND INTEGER COSINE TRANSFORM The forward and inverse DCT [5] are defined as (1) (2) where and stand for the original input matrix and the DCT coefficient matrix while serves as both forward and inverse DCT kernels. ICT originates from the DCT and was derived using the principle of dyadic symmetry [1]. The forward and inverse ICT are defined as (3) (4) where is the ICT coefficient matrix and is the ICT kernel. The ICT kernel includes two parts, and [1]. For the same ICT, the choice of and is not unique and can be represented as (5) where the, used in forward transform are denoted as, and the, used in inverse transform are denoted as,, respectively. The properties of,,, and their relationship with each other are described in the following paragraphs and will be frequently used in this paper. In this paper, we will concentrate on the order-4 and order-8 transforms since they are the most useful ones in practical applications and the discussions can be easily extended to transforms /$ IEEE

2 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 85 other than order-4 and order-8. Traditionally, order-8 transforms have been used for image and video coding. The order-8 size has the advantages of being large enough to capture the redundancy while being small enough to provide good adaptation to the inhomogeneity inside an image. However, small size transform such as order-4 transform has the advantages of reducing ringing artifacts at edges and discontinuities [2] [4]. Because of these reasons, H.264/AVC uses both order-4 and order-8 ICTs [4]. For an order-8 ICT kernel specified as [a,b,c,d;e,f;g] in this paper, and in (5) are defined as follows where subscript 8 is used to indicate the transform size is 8 8: Fig. 1. Block diagram of conventional ICT scheme in H.264. order-4 and order-8 ICTs, thus avoiding inverse transform mismatch problems. In H.264/AVC, on encoder side, using (5), the forward transform and quantization process can be represented as (11). For convenience, we do not take the practical rounding operation in the quantization process into account and here is a real number matrix rather than an integer matrix (6) and is an 8 8 diagonal matrix with its th diagonal elements such that where is the th row vector of. The values of,,,,,, and in an ICT should be integers. However, they are sometimes expressed as rational numbers by suitably adjusting. For example, in H.264/AVC, the choice of,,,,, and is 12/8, 10/8, 6/8, 3/8, 1, 4/8, and 1, which can be regarded the same as 12, 10, 6, 3, 8, 4, and 8, respectively. Similarly, for an order-4 ICT kernel specified as [a,b;c] in this paper, and in (5) are defined as follows where subscript 4 is used to indicate the transform size is 4 4: where, and are integers and is a 4 4 diagonal matrix with its th diagonal elements such that where is the th row vector of.in H.264/AVC, the values of,, and in the inverse transform are expressed as 1, 1/2, 1, which can be regarded the same as choosing, and as 2, 1, 1, respectively. In the rest of this paper, for convenience, we use the notation, which is a column vector, to represent the main diagonal of an matrix, i.e., (8) Two operators and used in this paper are defined as follows. When,, are all matrices, means (9) and similarly, means (10) (7) (11) where is the input matrix; is the quantized transformed coefficient matrix; is the quantization matrix on encoder side and the dequantization matrix on decoder side, which is dependent on quantization parameter (QP), when uniform quantization is used, can be simply replaced by QP; is the forward scaling matrix; and is the quantization-scaling matrix On decoder side, the corresponding inverse transform and dequantization process is (12) where is the reconstructed matrix theoretically is equal to ; is the inverse scaling matrix; and is the dequantization-scaling matrix. Equations (11) and (12) above represent the merging of the forward/inverse scaling and the quantization/dequantization operations into one step so as to reduce the computational complexity as in H.264/AVC. Fig. 1 is the block diagram for the conventional ICT scheme. Note that theoretically, in (11) and in (12) normally contain non-integer elements. However, in actual video coding standards, these are usually implemented using multiplications and shifts to reduce the complexity. For example, in H.264/AVC, in (11) and in (12) contain only integer elements and right-shift is applied to the results of (11) and (12). B. Proposed Prescaled Integer Transform Scheme Let us take order-4 transform as an example first. In H.264/ AVC, the QP period is 6 and a dequantization-scaling matrix is used for 4 4 ICT [2], [3], [32]. III. CONCEPT OF PRESCALED INTEGER TRANSFORM A. Conventional Integer Cosine Transform Scheme Unlike the popular order-8 DCT used in previous standards, such as MPEG1/2/4, H.261 and H.263, H.264/AVC employs (13)

3 86 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 For the th transformed coefficient in a block, the search rule for corresponding dequantization-scaling element in is defined by for equal to for equal to (14) else where % is modulus operator and x%y means remainder of x divided by y, defined only for integers x and y with and. The main problem here is that for every transformed coefficient, we need to conduct a 3-D operation, which uses QP and the coordinates of the coefficients in the block, to search for the corresponding elements in. Some computations as in (14) are also needed. In order to reduce the computational complexity, and at the same time facilitate parallel processing and take advantage of the efficient multiply/accumulate architecture of many processors, we can fully expand the to ',as in (15) and correspondingly the search rule is simplified to (16). However, the storage will increase from bytes to bytes. This memory size will be much larger when we also take 8 8 ICT into account. In that case, a total memory of bytes should be allocated. Fortunately, in fact, the required memory size and computational complexity can be reduced at the same time if the technique of PIT is used. The concept of PIT is first proposed by the authors in [6] and will be elaborated below Let (18) which will be called combined-scaling-quantization matrix in the following. Then the forward transform and quantization process can be represented as In order to derive the PIT kernel, we have and, therefore, the forward PIT kernel is Correspondingly, the inverse PIT kernel is (19) (20) (21) (22) From (17) (19) above, the inverse transform and dequantization process can be represented as (23) (15) From (23), we can also derive the theoretical inverse PIT kernel, which is the same as in (22). Based on the derivation above, theoretically we can define the forward and inverse PIT as Substituting (11) into (12), we get (16) (17) (24) (25) However, it should be noted that similar to the case of ICT shown in (11) and (12), we always implement PIT using (19) and (23) instead of using (24) and (25) directly. In the rest of this paper, for order-4 and order-8 ICT kernels [a,b;c] and [a,b,c,d;e,f;g], the corresponding order-4 and order-8 PIT kernels are denoted as [a,b;c] and [a,b,c,d;e,f;g], respectively. The main idea of PIT is that inverse scaling is moved to encoder side and combined with forward scaling and quantization as one single process. The fact that no scaling is needed on decoder side anymore distinguishes PIT from conventional ICT, and this is exactly the reason why PIT can reduce the required memory size and computational complexity on decoder side at

4 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 87 Fig. 2. Block diagram of proposed PIT scheme. the same time. The block diagram for the PIT scheme is shown in Fig. 2. When PIT is used, the dequantization-scaling matrix (or more accurately, the dequantization matrix, since no scaling is included any more) is as follows: (26) And for the th transformed coefficient in a block, the search rule for corresponding element in is further simplified to (27) Comparing (26) and (27) to (13) and (14), and (15) and (16), respectively, we can clearly see that the decoding complexity is reduced with PIT. First, if the dequantization-scaling matrix is fully expanded, when order-4 ICT is used, a total memory of bytes can be replaced by a memory of only 6 bytes. The memory required when using PIT is much lower than that required using conventional ICT. Otherwise, if the dequantization-scaling matrix is not expanded, a total memory of bytes can be reduced to 6 bytes. Though this saving is trivial considering only order-4 ICT, keeping in mind that in this case every non-zero coefficient needs a lookup operation and some extra operations when using ICT, the computational complexity of PIT will be lower because 3-D lookup operation is replaced by 1-D lookup operation which only uses QP, and at the same time no other extra operations as in (14) is needed. In either case, on decoder side, required memory size can be reduced and 3-D lookup operation can always be replaced by 1-D lookup operation thus pipeline and parallel processing is facilitated and extra computation or storage memory can be saved. While on encoder side, comparing (11) to (19), we can find that since forward scaling, inverse scaling and quantization are combined as one single process and the original quantization-scaling matrix is replaced by combined-scaling-quantization matrix with the same size, both the computational and storage complexity remain unchanged. We only consider order-4 transform in the discussion above. When both order-4 and order-8 transforms are used, a total memory of bytes can be saved and a memory of only 6 bytes is needed instead, assuming that the order-4 and order-8 PITs employed are compatible, which will be discussed in more detail in Section VI. Further, besides H.264/AVC, in some other video coding standards like AVS part 2 [12], which is Chinese National Coding Standard for digital TV Broadcasting and HD-DVD, the scheme of periodic QP is not used in order to reduce computational complexity. In this case, PIT can provide even larger saving of memory. Only a memory of 64 bytes is needed instead of a memory of bytes because QP range 0 63 and order-8 PIT is employed in AVS part 2. In order to save memory, the scaling and quantization/dequantization are separated in AVS part 2. In this case, a memory of bytes for inverse scaling matrix can be saved and at the same time the computational complexity is reduced when PIT is used because no scaling is needed on decoder side any more. C. Post-Scaled Integer Transform A variation of PIT might be post-scaled integer transform, in which the scaling of the forward transform can be moved to the decoder side. This can be potentially used to simplify the encoder in some kinds of applications. The issues of postscaled integer transform are similar to those of PIT. In this paper, we consider only PIT and similar analysis approaches can be carried out for post-scaled integer transform. D. Fixed-Point Error Analysis of Prescaled Integer Transform In this part, we first summarize our previous work on fixedpoint error analysis of ICT and PIT, and then give some comparison between them. In [7], an analysis of the statistical error behavior due to rounding of the transformed coefficients for the DCT and ICT is presented. The analysis considers a system which forwardly transforms pixels into coefficients and then inversely transforms back into reconstructed pixels. The transform kernel is assumed to be implemented precisely but the transformed coefficients are represented using finite number of bits and so errors are generated. Both theoretical and experimental results show that [10,9,6,2;9,3;8] generates a smaller mean square error between the original and reconstructed pixels than the DCT when the same number of bits is used for the representation of the transformed coefficients. When the number of bits increases, such mean square error decreases. It approaches zero faster in the case of ICT than the DCT. In [8], similar theoretical analysis and experiments are reported for [10,9,6,2;10,4;8] which is also found to generate a smaller mean square error (MSE) between the original and reconstructed pixels than the DCT when the same number of bits is used for the representation of the coefficients. Fig. 3 shows the experiment results for 1-D and 2-D transform systems. The image data are integers generated randomly with uniform distribution between [ 128,127]. Also, it is assumed that the 2-D transform introduces rounding error due to the use of finite number of bits for the 2-D transform coefficients at the output only and no rounding error is introduced inside the 2-D transform. The results in [7] and [8] were obtained under the assumption that and are implemented precisely without error. Under this condition, we would expect that PIT and the corresponding ICT have the same performance. However, it is interesting to note that the implementation of cannot be exact because it contains irrational numbers but when PIT is used the implementation of can be exact especially when which is usually true because it contains rational numbers. Furthermore, if the PIT kernel is selected carefully, e.g., [2,1;1], [3,1;2], when, can be represented using few bits. In this case, lossless coding of PIT coefficients can be achieved easily. For the case of ICT, however, lossless coding of transformed coefficients is difficult. For example, in H.264/AVC, only lossless coding of residue is supported [4].

5 88 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 Fig. 3. MSE of reconstructed pixels versus coefficient bit length in 1-D image vector and 2-D image data block [8]. Left: MSE of reconstructed pixels versus coefficient bit length in 1-D image vector. Right: MSE of reconstructed pixels versus coefficient bit length in 2-D image data block. IV. DESIGN OF PRESCALED INTEGER TRANSFORM Just like ICT, not every PIT kernels performs well. It is practically important to find design rules that lead to good PIT kernels. For convenience of discussion, (21) and (22) are rewritten here as (28) and (29) (28) (29) We can see that is constructed by and while is constructed by and. To choose good PIT kernels, we should take these two factors into account. It turns out that considering compression ability, we should choose a good and correspondingly while considering bit representation of transformed coefficients we should choose a good and correspondingly. In the following subsections, these two factors are examined in detail and based on the analysis, design rules are obtained. A. Consideration of Compression Ability In (28) and (29), and can be regarded as scaling factors applied to basis vectors of corresponding forward and inverse ICT kernels to construct corresponding PIT kernels. Since is diagonal, we can see that the normalized basis vectors of forward and inverse PIT kernels are the same as the corresponding normalized basis vectors of forward and inverse ICT kernels, respectively. This observation suggests that PIT kernels and corresponding ICT kernels have similar compression ability. In the following, transform coding gain [9] and DCT frequency distortion [10] are studied. 1) Transform Coding Gain: Strictly speaking, PIT is not an orthogonal transform, thus we should use biorthogonal transform coding gain [9] here. Biorthogonal transform coding gain measures the energy compacting ability of the transform in the transform coding system. Assume that the input vector is transformed into coefficient vector by an order-n forward transform, i.e., (30) in the transform do- then the covariance matrix of the vector main is given as (31) where is the expectation operator. If the input signal is modeled by a zero mean, first-order autoregressive (AR(1)) process which is characterized by the correlation coefficient, then the covariance matrix of the input sampled vector x has a form of Toeplitz matrix, i.e., is equal to (32) In this situation, the biorthogonal transform coding gain of a given transform is a function of the correlation coefficient and can be calculated analytically as follows [9]: (33) where is the norm of the th inverse transform basis vector. Note that (33) can also be used for orthogonal transform such as conventional ICT. Although and are often used as a measure of the compression ability of the given transform kernel, in this paper, we define (34) as the measure of the compression ability of a given transform, in order to take into account different correlation coefficients of the input data. This is because in advanced video coding standards such as H.264/AVC and AVS, the input data fed into the transform is the residue data after intra or inter prediction, and the correlation will be less than 0.9 in most cases [2]. Generally, larger indicates better compression ability of the given transform kernel.

6 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 89 2) DCT Frequency Distortion: DCT frequency distortion [10] measures the frequency distortion of a given transform with respect to the DCT. For order- transform with transform kernel, the first-order frequency distortion and the second-order frequency distortion are defined as (35) Equation (41) shows that PIT kernels are the same as corresponding ICT kernels in terms of transform coding gain. For DCT frequency distortion, from (28) and (37), we have Since is a diagonal matrix, it is easy to see (42) and (36) (43) respectively, where is calculated as (37) 3) Comparison of Prescaled Integer Transform and Integer Cosine Transform: In this subsection, we will discuss about the relationship between PIT kernel and its corresponding ICT kernel in terms of transform coding gain and DCT frequency distortion. Fortransformcodinggain,from(33),wecanfindthatthetransform coding gain of one certain ICT kernel and its corresponding PIT kernel are given by (38) and (39) below, respectively (38) (44) Equations (43) and (44) show that PIT kernels have the same DCT frequency distortion with corresponding ICT kernels. So basically, we can and we should use good ICT kernels to obtain good PIT kernels. B. Consideration of Weighting Factors of Transformed Coefficients Besides compression ability, because the forward PIT kernel is not normalized, it is important to analyze the effect of which is regarded as scaling factors applied to basis vectors of corresponding forward ICT kernels. From (24) and (21), we can get (39) And we have so (40) (45) It shows that the PIT coefficient matrix can be regarded as a weighted ICT coefficient matrix where is the weighting matrix. While all coefficients of an orthogonal ICT have the same maximum and minimum values, those of PIT do not because the weighting matrix changes the relative magnitudes of the transformed coefficients. We define the weighting factor difference (WFD) of a PIT to represent the difference of weighting factors applied to different transformed coefficients (46) (41) The deviation of the WFD of a PIT away from unity may cause a problem. It may result in truncation of some transformed coefficients that could be retained if where is scalar and is a matrix with all its elements equal to 1. However, unfortunately the condition can not always be true when PIT is used. In order not to change the transformed coefficients too much so as to retain the compression ability of corresponding ICT kernels, elements of the weighting matrix

7 90 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 should have values close to one another. Another advantage of this consideration is that the change from ICT to corresponding PIT will be small so that other parts of a codec need not be redesigned: the scan order of the transformed coefficients need not be changed and the entropy coding table designed for the ICT can also be reused for the PIT without much change. Besides, it is also worth noting here that compensation for the scaling effect of can also be potentially achieved by using customized quantization matrix, but additional complexity would be introduced. Generally, according to our experience, the following condition should be met TABLE I TEST CONDITIONS AND ENCODING PARAMETERS FOR ORDER-4 ICT AND PIT TABLE II TEST CONDITIONS AND ENCODING PARAMETERS FOR ORDER-8 ICT AND PIT (47) That is, the bit-width difference between largest element and smallest elements in should be no larger than 1/2. One could expect that more sophisticated measures using all the elements in rather than only the maximum and minimum in (46) will give a more precise evaluation, however, (47) already works rather well, which is justified by the experiment results in Section V. Also, since many PITs with WFD less than are already obtained and demonstrated to show good performance therein, there may be no need to search for PITs with WFD larger than because the smaller the WFD is, the closer the PIT is to the corresponding ICT. C. Design Rules Based on the discussion above, we can conclude that we can obtain a good PIT kernel following the steps below. 1) Obtain a good ICT kernel. To derive good ICT kernels, computer search can be systematically carried out based on transform coding gain and DCT frequency distortions. 2) Choose a good ICT kernel whose WFD is not larger than. D. Types of Prescaled Integer Transform Besides the design rules derived above, we have found that different types of PITs have different characteristics and are suitable for different applications, which is also an important issue in practical. There are generally two types of PITs, based on frequency scale factor (FSF). For order- PIT, FSF measures the scaling effect of on the transformed coefficients and is defined as (48) 1) Type-I PITs, whose FSF is less than 1, have a characteristic that more high-frequency components may be quantized out compared to corresponding ICT. Type-I PITs may be more suitable for video streaming and conferencing applications where low resolution (such as QCIF and CIF) video sequences are usually used and coded at relatively low bit rates because Type-I PITs lead to bit savings without degrading the subjective quality significantly in this situation. 2) Type-II PITs, whose FSF is larger than or equal to 1, have a characteristic that more high-frequency components may be retained after quantization compared to corresponding ICT. Type-II PITs may be more suitable for entertainment quality and other professional applications where higher resolution (such as HD) video sequences are often used and coded at relatively high bit rates because chance is higher that more detailed texture will be preserved. V. EXPERIMENTAL RESULTS OF FIXED BLOCK-SIZE TRANSFORM SCHEME WITH PRESCALED INTEGER TRANSFORM In order to test performance of different PIT kernels, extensive experiments have been done based on JM 7.6 and JM 9.3 [28]. The test conditions are listed in Tables I and II. The experiments on order-4 ICTs and PITs target video streaming and conferencing applications, so relatively large QPs and QCIF and CIF sequence are used. The experiments on order-8 ICTs and PITs target entertainment-quality and other professional applications so relatively small QPs and HD sequences are used. In all the experiments, context-based adaptive binary arithmetic coding (CABAC) [3], [4], [29] is used and number of reference frames are set to two. Figs. 4 and 5 give the plot of transform coding gain and transform coding gain difference compared with the DCT of different order-4 and order-8 ICTs and PITs where correlation coefficient is in the range [0,1], while Tables III and IV give the transform coding gain,, and DCT frequency distortion of different order-4 and order-8 ICTs and PITs. All the order-4 and order-8 ICTs and PITs chosen in our experiments have comparable transform coding gains with that of the DCT and little DCT frequency distortion, so they are expected to have good compression performance which has been proved by the experimental results given in Tables V VIII. The experimental results are presented in the form of average peak signal-to-noise ratio (PSNR) gains using the method proposed in [11]. Some rate-distortion curves are given in Figs. 6 and 7, respectively. Among all the ICTs and PITs in our experiments, [1,1/2;1] ( [2,1;1]) and are used in H.264/AVC, [10,9,6,2;9,3;8] was proposed in [1] because of its high decorrelation ability and relatively low complexity. [10,9,6,2;10,4;8] is adopted in AVS part 2 [12] because its implementation complexity is similar to

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 91 Fig. 4.

gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.

Left: Transform coding gain of order-8 DCT and different order-8 ICTs Right: Transform coding gain difference of different order-8 ICTs compared with order-8 DCT (Transform coding )

TABLE III TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-4 ICTS AND PITS TABLE IV TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-8 ICTS

8 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 91 Fig. 4. Left: Transform coding gain of order-4 DCT and different order-4 ICTs Right: Transform coding gain difference of different order-4 ICTs compared with order-4 DCT (Traform coding gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.) Note that the transform coding gain of these ICTs and the DCT are nearly the same and the curves in the left figure overlap. Fig. 5. Left: Transform coding gain of order-8 DCT and different order-8 ICTs Right: Transform coding gain difference of different order-8 ICTs compared with order-8 DCT (Transform coding gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.) Note that the transform coding gain of these ICTs and the DCT are nearly the same and the curves in the left figure overlap. TABLE III TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-4 ICTS AND PITS TABLE IV TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-8 ICTS AND PITS [10,9,6,2;9,3;8] but has a slightly higher transform coding gain, which is shown in Fig. 5 and Table IV. [3,1;2], [5,2;4], [9,4;7] were proposed in AVS part 7 (also known as AVS-M, where M represents mobility.), which targets applications of video communications on mobile devices, with [3,1;2] finally adopted for its simplicity and satisfactory

9 92 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 TABLE V COMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 TRANSFORM MATRIXCES USING ICT COMPARED TO H.264/AVC TABLE VI COMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC TABLE VII COMPRESSION PERFORMANCE OF DIFFERENT ORDER-8 TRANSFORM MATRIXCES USING ICT COMPARED TO H.264/AVC TABLE VIII COMPRESSION PERFORMANCE OF DIFFERENT ORDER-8 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC performance [12]. [17,7;13] is used in original H.264 design (also known as H.26L) [2] and has an advantage of having same basis vector norms, In this case, [17,7;13] is in fact the same as [17,7;13]. Similar technique as PIT with [22,10;17] and [16,15,9,4;16,6;12] are used in WMV9/VC-1 [30], [31]. However, the elements in component in these two transform kernels are relatively larger in order to meet the conditions mentioned in [30] which are much

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 93 Fig. 6. Rate-distortion curves of different order-4 PITs for news sequence (CIF, 30 fps) and foreman sequence (QCIF,15 fps). Fig. 7.

stricter than (47) and thus complexity is increased and accuracy is somewhat lost when implemented in 16 bit arithmetic. Moreover, [22,10;17] and [16,15,9,4;16,6;12] are not completely compatible.

10 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 93 Fig. 6. Rate-distortion curves of different order-4 PITs for news sequence (CIF, 30 fps) and foreman sequence (QCIF,15 fps). Fig. 7. Rate-distortion curves of different order-8 PITs for harbour sequence (HD, IBBP, 60 fps) and night sequence (HD, IPP, 30 fps). stricter than (47) and thus complexity is increased and accuracy is somewhat lost when implemented in 16 bit arithmetic. Moreover, [22,10;17] and [16,15,9,4;16,6;12] are not completely compatible. Here compatible means that different transforms can be implemented in only one transform unit rather than multiple: they can share the same forward/inverse transform butterfly structure and quantization-scaling/ dequantization-scaling matrix [21]. The following conclusions can be drawn based on the experimental results. 1) From Tables III and IV, we can see that [0,1] gives a very good estimation of the energy compacting ability of a given ICT or PIT, which is justified by the experimental results in Tables V VIII. Generally, and may be also good measures. But from Figs. 4 and 5 we can see that using a fixed correlation coefficient point may not be appropriate because transform coding gain difference varies for different correlation coefficients. DCT frequency distortion may also serve as an estimation of the compression performance of a given ICT or PIT, but it treats the distortion of each basis vectors equally, where a weighting should be considered intuitively. Further, distortion from the DCT might not necessarily be bad. Besides these two theoretical criterions, we should note that larger elements in component of the ICT or PIT kernel will suffer additional loss in PSNR when implemented in 16-bit arithmetic. 2) From Tables V VIII, we can see that PIT performs generally as well as the corresponding ICT if (47) is met. Generally, the smaller the WFD of the given PIT is, the smaller the difference with the corresponding ICT in PSNR is, which is expected. 3) According to the experimental results in Tables VI and VIII, Type-I PITs lead to lower bit rate and lower PSNR, while Type-II PITs result in higher bit rate and higher PSNR at the same QP, compared to corresponding ICTs. This is due to the weighting of the frequency components. The larger the FSF of the given PIT is, the higher the bit rate and PSNR are compared to corresponding ICT. 4) From Table VI, we can find that [3,1;2] performs as well as or even better than some other PITs though its transform coding gain is a little lower and its DCT frequency distortion is a little larger. This is mainly because no right shift is needed when it is implemented in 16 bit arithmetic, thus precision is retained. [3,1;2] belongs to Type-I PITs and is suitable for low fidelity video coding. It is also the simplest PIT kernel we can find that meets (47). Due to its low complexity and satisfactory performance, [3,1;2] is adopted by AVS part 7. 5) [10,9,6,2;10,4;8] has high transform coding gain and approximates the DCT very well. It outperforms other PITs as shown in Table VIII. [10,9,6,2;10,4;8] is a Type-II PIT and is fit for high fidelity video coding. Due to its good performance and favorable complexity reduction, [10,9,6,2;10,4;8] is adopted by AVS part 2. VI. ADAPTIVE BLOCK-SIZE TRANSFORM SCHEME WITH PRESCALED INTEGER TRANSFORM The fixed block-size transform scheme of PIT has been examined in detail in previous sections. In this section we study the ABT scheme of PIT. ABT was once introduced in H.26L but was removed later because of its high implementation complexity. However, ABT can not only improve the coding efficiency significantly [13], but can also provide subjective bene-

11 94 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 TABLE IX COMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 AND ORDER-8 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC fits, especially for HD movie sequences from the point of subtle texture preservation [14], such as keeping film details and grain noise which are very crucial to subjective quality especially for film contents [15]. Due to this, ABT has been considered again and adopted in fidelity range extensions (FRExt) of H.264/AVC with the major concern of complexity reduction [16] [20]. Compatible ABT is one major consideration during the H.264/AVC FRExt standardization work and the order-8 ICT adopted in H.264/AVC is especially chosen to be compatible with the existing order-4 ICT [17]. While it is well known that [a,b,c,d;e,f;g] is compatible with [e,f;g], it is also easy to understand that the corresponding PIT kernel [a,b,c,d;e,f;g] is also compatible with [e,f;g] because there is no need to change the butterfly structures when PIT is used and it is shown in [21] that [e,f;g] can also reuse the combined-scaling-quantization matrix of [a,b,c,d;e,f;g]. It has been mentioned that keeping film grain in film contents is crucial to subjective quality especially for HD sequence. This has also been paid much attention to during the H.264/AVC FRExt standardization work [22] [25]. As analyzed in [22], film grain preservation is always a challenge for encoders because the film grain is in nature temporally uncorrelated and thus large compression gains brought by temporal prediction can not be exploited efficiently. As a result, most of the film grain remains in the prediction error and appears as small transformed coefficients at high frequencies in the DCT domain and thus is typically quantized out with other noise for a wide range of QP values. Even at high bit rates, film grain can only be encoded and preserved at a high compression cost. In order to allow encoding film grain at lower bit rates more efficiently, H.264/AVC uses a standardized film grain characteristics supplemental enhanced information (SEI), which allows encoder to generate a parameterized model of film grain statistics instead of encoding the exact film grain, and send it along with the video data to decoder, to provide the information for film grain synthesis as a post process when decoding [4], [22], [23], [25], [32]. At lower bit rates, the film grain characteristics SEI message has been proved to be a good tool to improve picture quality for sequences with film grain. However, it was reported that SEI message is not accepted by the movie industry where higher target bit rates are used mainly because it is very difficult and even impossible to reach transparent picture quality for any kind of input sequence and have full control on the decoded sequence [26]. From the viewpoint of keeping details, Type-II PITs are more suitable because film grain and other subtle texture that get quantized out have a higher chance of survival. From the experimental results in Section V, it can be inferred that a compatible ABT scheme using [5,2;4] and is promising because both of them have high compression efficiency and are Type-II PITs. Extensive experiments have been done to evaluate the performance of [5,2;4] and. The objective results compared with H.264/AVC high profile and its ICT counterpart ( [5,2;4] and.) are given in Table IX. The operational rate-distortion curves are shown in Fig. 8 and the subjective quality comparison for stockholm sequence 1 is presented in Fig. 9. The test conditions are the same as those listed in Table II except that both order-4 and order-8 transforms are used. For subjective quality comparison, rate control is used and the sequence is encoded at 6Mbit/s and IBBP structure is used. Based on the experimental results in Table IX, the following conclusions can be drawn. 1) Both ABT schemes, [5,2;4] and, [5,2;4] and, outperform the one in H.264/AVC. Three GOP structures have been tested, i.e., IPPP, IBBP, Intra frame only. We can see that the gain in PSNR from IPPP and IBBP is only less than 0.05 db on average, which is mainly because of high efficiency of the motion compensation and it can be expected the gain will be larger if the number of reference frames is reduced to one instead of two and smaller search range for motion 1 [Online]. Available: ftp://ftp.ldv.e-technik.tu-muenchen.de/pub/test_sequences

Fig. 9. Subjective quality for stockholm sequence (720p, IBBP, 30 fps, 6 Mbps, 26th frame, local) top-left: original, top-right: PIT [5; 2; 4] + PIT [10; 9; 6; 2; 10; 4; 8]=2, 34.

12 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 95 Fig. 8. Rate-distortion curves of H.264/AVC, ICT [5; 2; 4] + ICT [10; 9; 6; 2; 10; 4; 8]=2, PIT [5; 2; 4] + PIT [10; 9; 6; 2; 10; 4; 8]=2 for harbour sequence (1) HD, IBBP, 60 fps; (2) HD, IPP, 30 fps; (3) HD, intra-frame only, 30 fps. Fig. 9. Subjective quality for stockholm sequence (720p, IBBP, 30 fps, 6 Mbps, 26th frame, local) top-left: original, top-right: PIT [5; 2; 4] + PIT [10; 9; 6; 2; 10; 4; 8]=2, db, bottom-left: ICT [5; 2; 4] + ICT [10; 9; 6; 2; 10; 4; 8]=2, db, bottom-right: H.264/AVC, db estimation is used in our experiments. However, the gain in PSNR from Intra frame is not trival which can be almost 0.2dB on average keeping in mind that H.264/AVC does not rely much on transform for decorrelation [27]. Noting that in real entertainment-quality and other professional applications such as DVD-Video systems and HDTV where HD sequences are generally used, frequent periodic intra coded pictures are typical in order to enable fast random access, the ABT schemes, [5,2;4] and, [5,2;4] and are good in terms of coding efficiency. 2) The subjective results in Fig. 9 show that the ABT scheme of [5,2;4] and is slightly better than both its ICT counterpart and that of H.264/AVC. More details are preserved. It should be noted that in our implementation of the ABT scheme of [5,2;4] and, on the decoder side, a memory of only 6 bytes is allocated to store the matrix ' which is used for dequantization (49) and, therefore, only 1-D lookup operation is needed. Thus, good performance and complexity reduction are achieved at the same time when the ABT scheme of [5,2;4] and is used.

13 96 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 VII. CONCLUSION In this paper, a new technique, named PIT, is proposed to further reduce the implementation complexity of conventional ICT scheme such as that adopted in H.264/AVC without any sacrifice in performance. Since not all PIT kernels perform well, design rules that lead to good PIT kernels are considered in this paper. It is also found that different types of PIT have different characteristics and are suitable for different applications. PIT has been adopted in AVS, Chinese National Coding standard with [10,9,6,2;10,4;8] in AVS part 2 and [3,1;2] in AVS part 7, respectively, due to its implementation complexity reduction and good performance. Besides fixed block-size transform scheme, compatible ABT scheme of PIT are also studied in this paper because of its higher coding efficiency and subjective benefits. A compatible ABT scheme using [5,2;4] and is proposed and it is shown that up to 0.2 db in PSNR for all intra frame coding can be achieved compared to the counterpart in H.264/AVC. Besides, subjective quality is also slightly improved because more subtle texture can be preserved. Using the same concept, a variation of PIT, post-scaled integer transform, can also be potentially used to simplify the encoder in some kinds of applications. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their useful comments to improve this paper. REFERENCES [1] W. K. Cham, Development of integer cosine transforms by the principle of dyadic symmetry, Proc. IEE, I, vol. 136, no. 4, pp , Aug [2] H. Malvar, A. Hallapuro, M. Karczwicz, and L. Kerofsky, Lowcomplexity transform and quantization in H.264/AVC, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp , Jul [3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp , Jul [4] G. J. Sullivan, P. N. Topiwala, and A. Luthra, The H.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions, in Proc. SPIE, Appl. Dig. Image Process. XXVII, Aug. 2004, vol. 5558, pp [5] N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform, IEEE Trans. Comput., vol. C-23, no. 1, pp , Jan [6] C.-X. Zhang, J. Lou, L. Yu, J. Dong, and W. K. Cham, The technique of pre-scaled integer transform, in Proc. IEEE ISCAS, May 2005, pp [7] W. K. Cham, Integer sinusoidal transform, Adv. Electron. Electron Phys., vol. 88, pp. 1 61, [8] W. K. Cham and C. K. Fong, ICT fixed point error performance analysis, in Proc Int. Symp. Intell. Multimedia, Video Speech Process., Oct. 2004, pp [9] J. Liang and T. D. Tran, Fast multiplierless approximations of the DCT with the lifting scheme, IEEE Trans. Signal Process., vol. 49, pp , Dec [10] M. Wien and S. Sun, ICT comparison for adaptive block transform, Doc. VCEG-L12. Eibsee, Germany, Jan [11] G. Bjontegaard, Calculation of average PSNR differences between RD-curves, Doc. VCEG-M33. Austin, TX, Apr [12] L. Yu, F. Yi, J. Dong, and C. Zhang, Overview of AVS-Video: Tools, performance and complexity, in Proc. VCIP, Beijing, China, Jul. 2005, pp [13] M. Wien, Variable block size transforms for H.264/AVC, IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp , Jul [14] S. Gordon, Adaptive block transform for film grain reproduction in high definition sequences, Doc. JVT-H029. Geneva, Switzerland, May [15] T. Wedi, Y. Kashiwagi, and T. Takahashi, H.264/AVC for next generation optical disc: A proposal on FRExt profiles, Doc. JVT-K025. Munich, Germany, Mar [16] M. Wien, Clean-up and improved design consistency for ABT, Doc. JVT-E025. Geneva, Switzerland, Oct [17] F. Bossen, ABT cleanup and complexity reduction, Doc. JVT-E087. Geneva, Switzerland, Oct [18] S. Gordon, Simplified use of transform, Doc. JVT-I022. San Diego, CA, Sep [19] S. Gordon, D. Marpe, and T. Wiegand, Simplified use of transform Proposal, Doc. JVT-J029. Waikoloa, HI, Dec [20] S. Gordon, D. Marpe, and T. Wiegand, Simplified use of transform Results, Doc. JVT-J030. Waikoloa, HI, Dec [21] J. Dong, J. Lou, C.-X. Zhang, and L. Yu, A new approach to compatible adaptive block-size transforms, in Proc. VCIP, Beijing, China, Jul. 2005, pp [22] C. Gomila and A. Kobilansky, SEI message for film grain encoding, Doc. JVT-H022. Geneva, Switzerland, May [23] C. Gomila, SEi message for film grain encoding: Syntax and results, Doc. JVT-I013. San Diego, CA, Sep [24] M. Schlockermann, S. Wittmann, and T. Wedi, Film grain coding in H.264/AVC, Doc. JVT-I034. San Diego, CA, Sep [25] C. Gomila and J. Llach, Film grain modeling versus encoding, Doc. JVT-K036. Munich, Germany, Mar [26] T. Wedi and S. Wittmann, Quantization with an adaptive dead zone size for H.264/AVC FRExt, Doc. JVT-K026. Munich, Germany, Mar [27] A. Puri, X. Chen, and A. Luthra, Video coding using the H.264/ MPEG-4 AVC compression standard, Signal Process.: Image Commun., vol. 19, no. 9, pp , Oct [28] JM7.6 and JM9.3 H.264 reference software [Online]. Available: iphome.hhi.de/suehring/tml [29] D. Marpe, H. Schwarz, and T. Wiegand, Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard, IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp , Jul [30] S. Srinivasan, P. J. Hsu, T. Holcomb, K. Mukerjee, S. L. Regunathan, B. Lin, J. Liang, M. C. Lee, and J. Ribas-Corbera, Windows media video 9: Overview and Applications, Signal Process.: Image Commun., vol. 19, no. 9, pp , Oct [31] S. Srinivasan and S. L. Regunathan, An overview of VC-1, in Proc. VCIP, Beijing, China, Jul. 2005, pp [32] Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, Advanced Video Coding (AVC) ITU-T Rec. T.81 and ISO/IEC (MPEG-4 Part 10), Mar Cixun Zhang received the B.Eng. and M.Eng. degrees in information engineering from Zhejiang University, Hangzhou, China, in 2004 and 2006, respectively. He is currently working toward the Ph.D. degree in Department of Information Technology at Tampere University of Technology, Tampere, Finland. Since August 2006, he has been a Researcher in the Institute of Signal Processing, Tampere University of Technology, Tampere, Finland. His research interests include video compression and communication. Mr. Zhang was the recipient of the AVS special award from the AVS working group of China in 2005.

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 97 Lu Yu (M 00) received the B.Eng. degree (hons.) in radio engineering and the Ph.D. degree in communication and electronic systems from Zhejiang University, Hangzhou, China, in 1991 and 1996, respectively.

She was a Senior Visiting Scholar in University Hannover and the Chinese University of Hong Kong in 2002 and 2004, respectively.

inventor or co-inventor of 14 granted and 13 pending patents. She published more than 80 academic papers and contributed 119 proposals to international and national standards in the recent years. Dr.

She organized the 15th International Workshop on Packet Video as a General Chair in 2006. She organized two special sessions and gave five invited talks and tutorials in international conferences.

14 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 97 Lu Yu (M 00) received the B.Eng. degree (hons.) in radio engineering and the Ph.D. degree in communication and electronic systems from Zhejiang University, Hangzhou, China, in 1991 and 1996, respectively. Since 1996, she has been with the faculty of Zhejiang University, Hangzhou, China, and is presently Professor of information and communication engineering. She was a Senior Visiting Scholar in University Hannover and the Chinese University of Hong Kong in 2002 and 2004, respectively. Her research areas include video coding, multimedia communication, and relative ASIC design, in which she is principal investigator of a number of national research and development projects and inventor or co-inventor of 14 granted and 13 pending patents. She published more than 80 academic papers and contributed 119 proposals to international and national standards in the recent years. Dr. Yu acts as the chair of the video subgroup of Audio Video coding Standard (AVS) of China and she was also the co-chair of implementation subgroup of AVS from 2003 to She organized the 15th International Workshop on Packet Video as a General Chair in She organized two special sessions and gave five invited talks and tutorials in international conferences. Now she serves as a member of Technical Committee of Visual Signal Processing and Communication of IEEE Circuits and Systems Society. Wai-Kuen Cham (S 77 M 79 SM 91) graduated from the Chinese University of Hong Kong in 1979 in electronics. He received the M.Sc. and Ph.D. degrees from Loughborough University of Technology, Loughborough, U.K., in 1980 and 1983, respectively. From 1984 to 1985, he was a Senior Engineer with Datacraft Hong Kong Limited and a Lecturer in the Department of Electronic Engineering, Hong Kong Polytechnic. Since May 1985, he has been with the Department of Electronic Engineering, the Chinese University of Hong Kong. His research interests include image coding, image processing and video coding. Jie Dong received the B.Eng. and M.Eng. degrees in Information Engineering from Zhejiang University, Hangzhou, China, in 2002 and 2005, respectively. She is currently working toward the Ph.D. degree in electronic engineering at the Chinese University of Hong Kong. Her research interests include HD video compression and processing. Jian Lou (S 07) was born in Hangzhou, Zhejiang, China. He received the B.E. and M.E. degrees in information science and electronic engineering from Zhejiang University, Hangzhou, China. He is currently working toward the Ph.D. degree in electrical engineering at the University of Washington, Seattle. His research interests include video processing and video compression. He has interned with several research laboratories including Microsoft Corporation, IBM T.J. Watson Research Center, Thomson R&D Laboratory, and Mitsubishi Electric Research Laboratories.

Performance Comparison between DWT-based and DCT-based Encoders

, pp.83-87 http://dx.doi.org/10.14257/astl.2014.75.19 Performance Comparison between DWT-based and DCT-based Encoders Xin Lu 1 and Xuesong Jin 2 * 1 School of Electronics and Information Engineering, Harbin