Improved Transform Methods for Low Complexity, High Quality Video Coding

Size: px

Start display at page:

Download "Improved Transform Methods for Low Complexity, High Quality Video Coding"

Lizbeth Harrell
6 years ago
Views:

1 Tampereen teknillinen yliopisto. Julkaisu 1033 Tampere University of Technology. Publication 1033 Cixun Zhang Improved Transform Methods for Low Complexity, High Quality Video Coding Thesis for the degree of Doctor of Science in Technology to be presented with due permission for public examination and criticism in Tietotalo Building, Auditorium TB222, at Tampere University of Technology, on the 26 th of March 2012, at 12 noon. Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2012

2 ISBN ISSN

3 i ABSTRACT A video signal requires a very large number of bits if it is represented in an uncompressed form. This makes many video applications not possible, due to the limited bandwidth of communication channels and the limit of the storage medium. Therefore, video coders that compress the video signal so that it could be represented with a smaller number of bits are used in all digital video applications. The goal of all video coders is to maximize the video quality while minimizing the bitrate. This is achieved by exploiting different redundancies present in the video signal. One type of redundancy is present between neighboring samples in the residual frame (spatial redundancy). The focus of this thesis is to develop algorithms to exploit the spatial redundancy with the goal of improving coding efficiency of video coders while keeping their implementation complexity as low as possible. The first part of this thesis introduces fixed-point design methodologies and several resulting implementations of the inverse discrete cosine transform (IDCT). The overall architecture used in the design of the ISO/IEC IDCT algorithm, which can be characterized by its separable and scaled features, is also described. The drift characteristic for the integer IDCT approximations is also analyzed. The second part of the thesis discusses pre-scaled integer transform (PIT) which reduces the implementation complexity of the conventional integer cosine transform (ICT) and maintains all the merits such as bit-exact implementation and good coding efficiency. Design rules that lead to good PIT kernels are developed and different types of PIT and their target applications are examined. The PIT kernels used in Audio Video coding Standard (AVS), the Chinese National Coding standard, are also introduced. In the third part of the thesis we discussed a novel algorithm called spatially varying transform (SVT). Unlike state-of-art video codecs where the position of the transform block is fixed, SVT enables video coders to vary the position of the transform block. In addition to changing the position of the transform block, the size of the transform can also be varied within the SVT framework, in order to better localize the prediction error so that the underlying correlations are better exploited. It is shown that by varying the position of the transform block and its size, characteristics of prediction error are better localized, and the coding efficiency is thus improved. A novel low complexity algorithm, operating on a macroblock and a block level, is also proposed to reduce the encoding complexity of SVT. An extension of SVT, called Prediction Signal Aided Spatially Varying Transform (PSASVT) is also discussed, which utilizes the gradient of prediction signal to eliminate the unlikely location parameters (LPs). As the number of candidate LPs is reduced, a

4 ii smaller number of LPs are searched by encoder, which reduces the encoding complexity. In addition, less overhead bits are needed to code the selected LP and thus the coding efficiency can be improved. This reduction in encoding complexity is achieved with a slight increase in coding efficiency, as the number of candidate LPs is reduced. The decoding complexity increase is only a little.

5 iii ACKNOWLEDGMENT The work presented in this thesis has been carried out in parts at the Department of Signal Processing in Tampere University of Technology (TUT), and at Nokia Research Center (NRC), both located in Tampere, Finland. During the preparation of this thesis, I have been working in co-ordination with two research groups. In TUT, the group was the Multimedia Research Group lead by Prof. Moncef Gabbouj. At NRC, it was the Audio- Visual Content Representation Team. This thesis would remain incomplete without acknowledging some people. Firstly, I wish to express my deepest gratitude to my supervisor Prof. Moncef Gabbouj who has provided me with their support, encouragement, and scientific guidance throughout the years of my work and study. Further, I would also like to thank a number of people who have directly or indirectly contributed to this thesis or have been working with me on related topics. I would like to thank Kemal Ugur, Jani Lainema, Antti Hallapuro for providing opportunities, technical guidance, valuable comments, and fruitful discussions. I would like to thank Ms. Virve Larmila, Ms. Ulla Siltaloppi and Ms. Elina Orava for their great help for some routine but important administration work. I convey special acknowledgement to Zhejiang University, from which I got solid academic training, during my undergraduate and master studies, which has benefited me for life long. My special thanks go to Prof. Lu Yu, my master supervisor, who guided me patiently on my early research activities and publications. Financial support of Nokia Foundation is gratefully acknowledged. Finally, I would wish to thank everyone who contributed to the successful completion of the thesis.

6 iv

7 v CONTENTS Abstract... i Acknowledgment... iii Contents... v List of Publications... vii List of Abbreviations... ix List of Tables... xi List of Figures... xiii 1. Introduction Fundamentals of Video Coding Transforms Used in Video Coding Outline and Objectives of the Thesis Author s Contributions to the Publications Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform Precision Requirements for IDCT Implementations in MPEG and ITU-T Standards Fixed-Point Approximations of DCT/IDCT The ISO/IEC IDCT Directional Transforms Reorganization-Based Directional Transform Lifting-Based Directional Transforms Data-Dependent Directional Transforms Pre-Scaled Integer Transform Design Rules of PIT Kernels Types of Pre-Scaled Integer Transform Spatially Varying Transform... 27

8 vi 5.1 Design of SVT Selection of SVT Block-size Selection and Coding of Candidate LPs Filtering of SVT Block Boundaries Implementing Spatially Varying Transform in the H.264/AVC Framework Fast Algorithms for Spatially Varying Transform Macroblock Level Fast Algorithm Block Level Fast Algorithm Experimental Results Prediction Signal Aided Spatially Varying Transform Implementation of Prediction Signal Aided Spatially Varying Transform Experimental Results Conclusion References... 51

9 vii LIST OF PUBLICATIONS This thesis is written in a summary style followed by the following publications listed below as appendices. The publications are referenced in the thesis as [P1], [P2], etc. [P1] Yuriy A. Reznik, Arianne T. Hinds, Cixun Zhang, Lu Yu, Zhibo Ni, Efficient Fixed- Point Approximations of 8x8 Inverse Discrete Cosine Transform, in Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp , San Diego, 28 Aug [P2] Arianne T. Hinds, Yuriy A. Reznik, Lu Yu, Zhibo Ni, Cixun Zhang, Drift Analysis for integer IDCT, in Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp , San Diego, 28 Aug [P3] Cixun Zhang, Lu Yu, Multiplier-less Approximation of the DCT/IDCT with Low Complexity and High Accuracy, in Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp , San Diego, 28 Aug [P4] Cixun Zhang, Lu Yu, Jian Lou, Wai-Kuen Cham, Jie Dong, The Technique of Prescaled Integer Transform: Concept, Design and Applications, IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 18, Issue. 1, Jan 2008, pp [P5] Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj, "Video Coding Using Spatially Varying Transform," IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 21, Issue. 2, Feb. 2011, pp [P6] Cixun Zhang, Kemal Ugur, Jani Lainema, Moncef Gabbouj, Video Coding Using Spatially Varying Transform, in Proceedings Pacific-Rim Symposium on Image and Video Technology (PSIVT), Tokyo, Japan, 13th-16th, Jan 2009, pp [P7] Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj, Prediction Signal Aided Spatially Varying Transform, in IEEE International Conference on Multimedia and Expo (ICME), Barcelona, Spain, July, 2011.

10 viii

11 ix LIST OF ABBREVIATIONS 720p 1080p 1280 pixels wide by 720 pixels height in progressive scan 1920 pixels wide by 1080 pixels height in progressive scan 1D, 2D, 3D One-, Two-, Three-Dimensional CIF DVD Blu-ray H.264/AVC HEVC Common intermediate format Digital Versatile Disc Optical disc storage medium with capacity higher than DVD Video compression standard jointly developed by MPEG and VCEG High efficiency video coding, name for the next generation standard KTA Macroblock Mbits/s MPEG-4 NAL JVT MPEG PSNR QCIF RD RDO SSD TV VCEG VCL Key technological area, experimental software maintained by VCEG A block containing one or more transform blocks Megabits per second Family of video compression standards developed by MPEG Network access layer Joint video team of MPEG and VCEG Moving picture experts group, standard developing organization Peak signal to noise ratio Quarter common intermediate format Rate distortion Rate distortion optimization Sum of squared differences Television Video coding experts group, standard development organization Video coding layer

12 x YUV A color space

13 xi LIST OF TABLES Table 1 The IEEE 1180 ISO/IEC Pseudo-Random IDCT Test Metrics and Conditions [P1] 12 Table 2 Average Results of ΔBD-Rate Compared to H.264/AVC 38 Table 3 Average Results of Percentage of 8 8 SVT and SVT Selected in VBSVT for Different Sequences 38 Table 4 Encoding/Decoding Time Increase of FVBSVT Compared to H.264/AVC (High Complexity Configuration) (CABAC, 720p) 39 Table 5 Experimental Results of PSASVT 46 Table 6 Percentage of Macroblocks Coded in SVT for Different Sequences 47

14 xii

15 xiii LIST OF FIGURES Figure 1 Illustration of temporal and spatial redundancy in a video frame 4 Figure 2 Block diagram of the latest video coding standard H.264/AVC 4 Figure 3 Fixed-point 8x8 IDCT architecture. [P1] 14 Figure 4 Loeffler-Ligtenberg-moschytz IDCT factorization (left), and its generalized scaled form (right). [P1] 14 Figure 5 Six directional modes defined in a similar way as was used in H.264/AVC for the block size 8 8. The vertical and horizontal modes are not included here. [9] 16 Figure 6 N N image block in which the first 1-D DCT will be performed along the diagonal down-left direction. [9] 17 Figure 7 Example of N=8: arrangement of coefficients after the first DCT (left) and arrangement of coefficients after the second DCT as well as the modified zigzag scanning (right). [9] 17 Figure 8 Factorizing 8-point DCT into primary operations. x[n] and y[n] (n=0,1,,7) are input signal and output DCT coefficient, respectively. Oi (i=1,2,,35) is primary operation. [28] 18 Figure 9 Exemplified primary operations (a) nondirectional and (b) directional, where white circles denote integer pixels and gray circles half pixels. [28] 19 Figure 10 Block diagram of conventional ICT scheme in H.264/AVC 21 Figure 11 Block diagram of proposed PIT scheme 21 Figure 12 Illustration of 8 8 spatially varying transform. 29 Figure 13 Illustration of 16 4 and 4 16 spatially varying transform. 29 Figure 14 Block diagram of the extended H.264/AVC encoder with spatially varying transform. 32 Figure 15 Illustration of hierarchical search algorithm. 35 Figure 16 Rate-distortion curves for Night sequence (CABAC) and Panslow sequence (CAVLC). 37 Figure 17 (A) Distribution of the macroblocks coded with SVT, (B) corresponding SVT blocks for 32nd prediction error frame. (Night sequence) 40 Figure 18 (A) Distribution of the macroblocks coded with SVT, (B) corresponding SVT blocks for 65th prediction error frame. (Sheriff sequence) 41 Figure 19 Distribution of (Δx, Δy) of 8 8 SVT for BigShips sequence [P5]. 43

16 xiv

17 1 CHAPTER 1 INTRODUCTION Video coding deals with representation of video data, which can be used for transmission and/or storage applications. The goals of video coding are to accurately represent the video data compactly, provide means to navigate the video (i.e. search forwards and backwards, random access, etc) and other additional author and content benefits such as text (subtitles), meta information for searching/browsing and digital rights management. The biggest challenge of video coding is to reduce the size of video data using compression, which is a key technology that enables a number of video services in many industries ranging from entertainment to communications. Anytime a user enjoys a video clip on a popular video sharing site, records a video on a mobile phone or on a digital camera, plays a movie from a DVD or a Blu-ray disc, attends a video-conferencing meeting, uses video surveillance for security, etc. video compression technologies are seamlessly being used. The need for video compression in these services arises from the fact that a significant amount of space or bandwidth is required to store and transmit an uncompressed video. Consider a video sequence with a resolution of pixels (a commonly used resolution in DVD discs), played at 25 frames-per-second. The bitrate of this video with three color components at 8 bits per pixel is over 200 Mbits/s. To store this video in current DVD discs, a compression by a factor of at least 200 is required. Nowadays, the resolution used in current video services is increasing rapidly. High Definition (HD) resolutions of 720p ( pixels) and 1080p ( pixels) are becoming increasingly common for digital TV broadcast, mobile video services (video telephony, mobile TV etc.) and handheld consumer electronics (digital still cameras, camcorders etc). With the advances in display and capture technologies, it is foreseen that even higher resolutions such as pixels will be common in consumer entertainment applications. In addition to increased spatial resolution, the temporal frame rates of the video sequences are expected to increase to around 60 frames-per-second and higher. To enable digital video services with these increased resolutions, more efficient compression technologies need to be designed. In addition to compression capabilities of video coders, the implementation complexity is a major concern. Consider the case of watching a video sampled at 1080p resolution at 60 frames-per-second. This means that a video decoder needs to process more than 6 million pixels in less than 20 milliseconds. Therefore, a practical video coder needs to achieve significant compression rates and at the same time operate at reasonable complexity levels. The implementation complexity is especially important in realizing video services on mobile and other handheld devices with limited computational and storage resources.

18 2 At the time of writing the thesis, increased compression efficiency at high resolutions and improved computational efficiency are key requirements identified in the video standardization community for the next generation video standardization project [1]. 1.1 FUNDAMENTALS OF VIDEO CODING The basic concepts of digital video including capture and representation of video in digital form, color spaces, formats and quality measurement are given below: Capture: A natural visual scene is spatially and temporally continuous. Representing a visual scene in digital form involves sampling the real scene spatially and temporally. In spatial sampling, sampling occurs at each of the intersection points on the grid and the sampled image may be reconstructed by representing each sample as a square picture element (pixel). The visual quality of the image is thus influenced by the number of sampling points. Choosing a coarse sampling grid produces a low-resolution sampled image whilst increasing the number of sampling points slightly increases the resolution of the sampled image. In temporal sampling, a moving video image is captured by taking a rectangular snapshot of the signal at periodic time intervals. A higher temporal sampling rate (frame rate) gives apparently smoother motion in the video scene but requires more samples to be captured and stored. A video signal may be sampled as a series of complete frames (progressive sampling) or as a sequence of interlaced fields (interlaced sampling). Color spaces: The method chosen to represent brightness and color is described as a color space. There are 2 commonly used color spaces: RGB and YCbCr. In the RGB color space, a color image sample is represented with three numbers that indicate the relative proportions of Red, Green and Blue (the three additive primary colors of light). Any color can be created by combining red, green and blue in varying proportions. The YCbCr color space and its variations (sometimes referred to as YUV) is a popular way of efficiently representing color images. Y is the luminance component and U and V are the chrominance components. The complete description of a color image is given by the luminance component Y and the color differences U and V that represent the difference between the color intensity and the mean luminance of each image sample. Format: CIF (Common Intermediate Format, ), QCIF (Quarter CIF, ), etc are the most commonly used video formats. Quality: The quality of video is measured subjectively and objectively. In subjective quality measurement, the quality of video is evaluated by the viewers. The most commonly used objective quality measurement is Peak Signal to Noise Ratio (PSNR).

19 A hybrid video encoder consists of three main functional units: a temporal model, a spatial model and an entropy coder. 1. The input to the temporal model is an uncompressed video sequence. The goal of the temporal model is to reduce temporal redundancy (shown in Figure 1) between transmitted frames by forming a predicted frame and subtracting this from the current frame. The key issue of a temporal model is motion estimation and compensation, which can be conducted for different block sizes, accuracy. The output of the temporal model is a residual frame and a set of model parameters, typically a set of motion vectors describing how the motion was compensated. 2. The residual frame forms the input to the spatial model which makes use of similarities between neighboring samples in the residual frame to reduce spatial redundancy (shown in Figure 1). Typically this is achieved by applying a transform to the residual samples and quantizing the results. The transform converts the samples into another domain in which they are represented by transform coefficients. The coefficients are quantized to remove insignificant values, leaving a small number of significant coefficients that provide a more compact representation of the residual frame. The output of the spatial model is a set of quantized transform coefficients. 3. The entropy coder compresses the parameters of the temporal model (typically motion vectors) and the spatial model (coefficients). This removes statistical redundancy in the data and produces a compressed bit stream or file that may be transmitted and/or stored. A compressed sequence consists of coded motion vector parameters, coded residual coefficients and header information. Correspondingly, the video decoder reconstructs a video frame from the compressed bit stream. The coefficients and the motion vectors are decoded by an entropy decoder after which the spatial model is decoded to reconstruct a version of the residual frame. The decoder uses the motion vector parameters, together with one or more previously decoded frames, to create a prediction of the current frame and the frame itself is reconstructed by adding the residual frame to this prediction. 3

20 4 FIGURE 1 ILLUSTRATION OF TEMPORAL AND SPATIAL REDUNDANCY IN A VIDEO FRAME The latest video coding standard, H.264/AVC [2], is a hybrid video coder, which is developed by the Joint Video Team (JVT) of the Moving Picture Experts Group (MPEG) of the International Standardization for Organization (ISO)/International Electrotechnical Commission (IEC) and Video Coding Experts Group (VCEG) of the International Telecommunication Union (ITU) Telecommunication Standardization Sector (ITU-T). H.264/AVC is published by MPEG as MPEG-4 part 10 Advanced Video Coding (AVC) and by ITU-T as ITU-T Recommendation H.264. Coder Control Input Video + - Transform/ Quant. Control Data Quant. Trans. Coeffs. Decoder Inv. Quant./Inv. Transform + Entropy Coding Output Bitstream Intra / Inter Intra-frame Prediction Motion Compensation Deblocking Filter Frame Buffer Output Video Motion Data Motion Estimation FIGURE 2 BLOCK DIAGRAM OF THE LATEST VIDEO CODING STANDARD H.264/AVC

21 H.264/AVC provides mechanisms for coding video that are optimized for compression efficiency and aim to meet the needs of practical multimedia communication applications. Figure 2 below shows the diagram of an H.264/AVC codec (encoder and decoder). The key modules are transform/quantization, intra-frame prediction, motion estimation/compensation, deblocking filter, entropy coding etc. The first version of H.264/AVC defines a set of three Profiles: Baseline, Main, Extended, each supporting a particular set of coding functions and each specifying what is required of an encoder or decoder that complies with the Profile. Performance limits for codecs are defined by a set of Levels, each placing limits on parameters such as sample processing rate, picture size, coded bitrate and memory requirements. 1. The Baseline Profile supports coded sequences containing I and P slices. I slices contain intra-coded macroblocks in which each or 4 4 luma region and each 8 8 chroma region is predicted from previously coded samples in the same slice. P slices may contain intra coded, inter coded or skipped MBs. Inter coded MBs in a P slice are predicted from a number of previously coded pictures, using motion compensation with quarter-sample (luma) motion vector accuracy. After prediction, the residual data for each MB is transformed using a 4 4 integer transform and quantized. Quantized transform coefficients are reordered and the syntax elements are entropy coded. In the Baseline Profile, transform coefficients are entropy coded using a context-adaptive variable length coding scheme (CAVLC) and all other syntax elements are coded using fixedlength or Exponential-Golomb Variable Length Codes. Quantized coefficients are scaled, inverse transformed, reconstructed (added to the prediction formed during encoding) and (optionally) filtered with a de-blocking filter before being stored for possible use in reference pictures for further intra- and inter-coded macroblocks. 2. The Main Profile is almost a superset of the Baseline Profile, except that multiple slice groups, ASO (arbitrary slice ordering) and redundant slices (all included in the Baseline Profile) are not supported. The additional tools provided by Main Profile are B slices (bi-predicted slices for greater coding efficiency), weighted prediction (providing increased flexibility in creating a motion-compensated prediction block), support for interlaced video (coding of fields as well as frames) and CABAC (an alternative entropy coding method based on context adaptive arithmetic coding). 3. The Extended Profile may be particularly useful for applications such as video streaming. It includes all of the features of the Baseline Profile (i.e. it is a superset of the Baseline Profile, unlike the Main Profile), together with B slices, weighted prediction and additional features to support efficient streaming over networks such as the Internet. SP and SI slices facilitate switching between different coded streams and VCR-like functionality and Data Partitioned slices can provide improved performance in error-prone transmission environments. 5

22 6 A coded H.264/AVC video sequence consists of a series of NAL units, each containing a raw byte sequence payloads (RBSP). Coded slices (including Data Partitioned slices and IDR slices) and the End of Sequence RBSP are defined as VCL NAL units whilst all other elements are just NAL units. 1.2 TRANSFORMS USED IN VIDEO CODING Transforms are widely used in video coding to remove spatial redundancy. The Karhunen-Loeve transform (KLT) [3] is an optimal transform in terms of decorrelation properties and energy compaction representation of the transform data. However, the kernel of KLT is data dependent and should be derived for each image sub-block. Moreover, in general the KLT kernel is non-separable and this would require full matrix multiplication which demands unreasonable computational resources. However, it has been shown that for the majority of signal processing applications the type II Discrete Cosine Transform (DCT) is a close approximation of KLT [3]. Basis functions of DCT length N are defined as: (2 j+ 1) iπ Ai = Ci cos for i, j = 0,..., N 1 2N 1, i = 0 N where Ci = 2, i > 0 N (1.1) where i and j are row and column indexes of the transform matrix, and Ci is a scale factor which is normalizing the transform matrix. The DCT serves as a close approximation of eigenvectors of KLT for signals that can be modeled as first-order Markov processes with high correlation. The output of a 2D DCT is a set of N N coefficients representing the image block data in the transform domain and these coefficients can be considered as weights of a set of standard basis patterns. Typically, the large portion of the signal energy is to be represented by a relatively few number of the transform coefficients, thus energy compaction is achieved. Another advantage of DCT is that its multidimensional versions can be computed with separable product of DCT along each of the directions. 2D DCT coefficients may be computed by first applying the one dimensional DCT to each image row and then using the corresponding coefficients from each row as input to a one dimensional DCT applied along the columns. There are a number of fast computation algorithms for DCT, when N is not prime and, in particular, when N is a power of 2. In the latter case, these algorithms require O(Nlog2N) arithmetic operations to transform Np image or video pixels, with a block size of N.

23 The (near) optimality of the DCT, the separable nature of the 2D DCT as well as the large amount of fast computational algorithms available in the literature [3], [4] and [5], have allowed the wide applicability of this transform. For example, DCT or DCT-like transforms are utilized in all transform-based coding schemes as JPEG, H.26x, and MPEG1/2/4. In earlier work, residual decoding contains the possibility of drift (mismatch between the decoded data in the encoder and decoder). The drift arises from the fact that the inverse discrete cosine transform (IDCT) is not fully specified in integer arithmetic; rather it must satisfy statistical tests of accuracy compared with a floating point implementation of the inverse transform (e.g., the IDCT accuracy specification in H.261 Annex A). For more information on the analysis of the drift and the IDCT design, refer to [P1], [P2] and [P3]. H.264/AVC makes extensive use of prediction, since even the intra coding modes rely upon spatial prediction. As a result, H.264/AVC is very sensitive to prediction drift. To achieve drift-free residual coding, in H.264/AVC, an integer orthogonal approximation to the DCT, namely Integer Cosine Transform (ICT) is used [17], allowing for bit-exact implementation for all encoders and decoders, thus avoiding inverse transform mismatch problem and solving the drift problem caused by DCT. The 4 4 and 8 8 ICT matrices, adopted in H.264/AVC, are given below in (1.2), (1.3), (1.4) and (1.5) respectively. They are proved to provide similar compression efficiency as DCT. 7 T 4f = (1.2) T 4i = (1.3)

24 8 T 8f = (1.4) T 8i = T (1.5) T 8f Recently, a variation of ICT, called Prescaled Integer Transform (PIT) [P4], was proposed to further reduce the implementation complexity of ICT while all the merits of ICT are kept and no penalty in performance is observed. PIT has been adopted in audio video coding standard (AVS), Chinese National Coding standard [6]. 1.3 OUTLINE AND OBJECTIVES OF THE THESIS This thesis presents novel algorithms and methods to improve the coding efficiency and reduce the complexity of video coders. The developed algorithms have a particular focus on high-resolution, low complexity use-cases to enable the emerging high quality high resolution video services on mobile terminals. The first main contribution of this thesis is to introduce fixed-point design methodologies and several resulting implementations of the inverse discrete cosine transform [P1][P2][P3]. The drift analysis for integer IDCT is also given. The second main contribution of this thesis is to introduce pre-scaled integer transform (PIT) [P4] which reduces the implementation complexity of the conventional integer cosine transform (ICT) and maintains all the merits such as bit-exact implementation and good coding efficiency. Design rules that lead to good PIT kernels are developed and different types of PIT and their target applications are examined. The third main contribution of this thesis is to introduce a new transform technique called spatially varying transform, which achieves up to 13.50% bitrate reduction compared to the latest video coding standard H.264/AVC, and at the same time with low implementation complexity [P5][P6][P7]. The main goal of the research work presented in this thesis is to develop novel algorithms to be used in future video coding standards. It should be noted that, at the time of writing the thesis, VCEG and MPEG, the two international standardization organizations have joined and formed the Joint Collaborative Team on Video Coding (JCT-VC). JCT-VC

25 embarked on a new project to develop the next generation video coding standard called High Efficiency Video Coding (HEVC). The preliminary requirements for HEVC were bit rate reduction of 50% at the same subjective image quality comparing to H.264/AVC high profile, with computational complexity ranging from ½ to 3 times that of the high profile. It was also stated that HEVC would be able to provide 25% bit rate reduction along with 50% reduction in complexity at the same perceived video quality as the high profile, or to provide greater bit rate reduction with somewhat higher complexity. Many new techniques, e.g., extended macroblock sizes, improved interpolation, and flexible motion representation etc, are adopted in HEVC. The proposed novel algorithms and methods in this thesis are developed to improve the coding efficiency and reduce the complexity of video coders, thus could be potentially used in the HEVC framework AUTHOR S CONTRIBUTIONS TO THE PUBLICATIONS Publication [P1] proposes efficient fixed-point approximations of 8x8 inverse discrete cosine transform. The idea is proposed by Dr. Reznik and Dr. Hinds. The author proposed the extended linearity test and performed some of the simulations. Publication [P2] presents a drift analysis for integer IDCT. The idea is proposed by Dr. Hinds and Dr. Reznik. The author helped perform some of the simulations. Publication [P3] presents a multiplier-less approximation of the DCT/IDCT with low complexity and high accuracy. The idea is proposed by the author. The author implemented the idea, performed all the simulations and wrote most of the publication, with assistance of the co-authors. Publication [P4] proposes the technique of pre-scaled integer transform. The idea is proposed by Prof. Cham and Dr. Lou. The author implemented the ideas in the H.264/AVC reference software, performed the theoretical analysis and all the simulations and most of the writing of the publication, with assistance of the co-authors. Publication [P5] presents the concept, design and implementation of spatially varying transform. In [P5], the concept, design and implementation of spatially varying transform is first introduced. The idea presented in [P5] is co-invented by the author and Jani Lainema. The author implemented the ideas in the H.264/AVC reference software, performed all the simulations and most of the writing of the publication, with assistance of the co-authors. Publication [P6] extends the analysis and work done in [P5]. The main ideas are proposed by the author. The author also implemented the algorithms, performed all the simulations and wrote most of the publication, with assistance of the co-authors.

26 10 Publication [P7] proposes the idea of prediction signal aided spatially varying transform, which utilizes the gradient of prediction signal to eliminate the unlikely spatially varying transform positions. The original idea was conceived by the author. The author implemented the idea, performed all the simulations and wrote most of the publication, with assistance of the co-authors, especially Dr. Kemal Ugur.

27 11 CHAPTER 2 FIXED-POINT APPROXIMATIONS OF THE 8X8 INVERSE DISCRETE COSINE TRANSFORM The Discrete Cosine Transform (DCT) is widely used in video and image coding applications as it is the transform used in both MPEG [44] and JPEG [45] standards. The Inverse Discrete Cosine Transform (IDCT) is the inverse process of the DCT. Theoretically, the DCT/IDCT is defined as real number operations and the implementation complexity is high, thus reducing the complexity is one of the most important issues when designing DCT/IDCT implementation. Another important issue specifically for IDCT implementation is to reduce the drift problem (i.e., mismatch accumulation leading to degradations in quality of reconstructed video), which is caused by different IDCT implementations in encoder and decoder. In early 2005, because of the expiration and withdrawal of a related IDCT precision specification (IEEE Standard [46]), MPEG has decided to improve its handling of this matter, and produce A new precision specification ISO/IEC , replacing the IEEE 1180 standard, and harmonizing the treatment of precision requirements across different MPEG standards [47], and A new voluntary standard, ISO/IEC , providing specific (deterministically defined) examples of fixed-point 8x8 IDCT and DCT implementations. 2.1 PRECISION REQUIREMENTS FOR IDCT IMPLEMENTATIONS IN MPEG AND ITU-T STANDARDS Specifications of MPEG, JPEG, and several ITU-T video coding standards, do not require IDCTs to be implemented exactly to generate the same output as the theoretical output. The MPEG IDCT precision standard: ISO/IEC [48], defines the precise specification of the error tolerances, and how they are to be measured for a given IDCT implementation. This standard includes several tests using a pseudo-random input generator originating from the former IEEE Standard [46], as well as additional tests required by MPEG standards. Table 1 below summarizes the error metrics defined by the IEEE 1180 ISO/IEC specification. The variable i=1,,q indicates the index of a pseudo random input (an 8x8 matrix) used in a test, and Q=10000 (or in some tests, Q=100000) denotes the total number of sample matrices. The tolerance for each metric is provided in the last column of the paper.

28 12 TABLE 1 THE IEEE 1180 ISO/IEC PSEUDO-RANDOM IDCT TEST METRICS AND CONDITIONS [P1] where f ˆ i yx and eyx= yx yx f are theoretical and practical IDCT outputs, respectively; dyx= f ˆ i i yx yx f ; ˆ i i 2 ( f f ). In addition, a so-called linearity test [49][50] is used as an additional (informative) test provided in the ISO/IEC /FPDAM1 specification [51]. This test requires the reconstructed pixel values produced by an IDCT implementation under test fyx to be symmetric with respect to the sign reversal of its input coefficients where Fvu is the IDCT input and z, v, t, u, s are variables: i yx z, v= tu, = s, z= 1,3,...,527, fyx ( Fvu ) = fyx ( Fvu ), Fvu = 0, otherwise, v, u, t, s = 0,...,7. (2.1) 2.2 FIXED-POINT APPROXIMATIONS OF DCT/IDCT There are many efficient and well-known factorizations for 1D DCT, including those proposed by Arai et. al. (AAN) [52], (only 5 multiplication and 29 addition operations), and the non-scaled factorizations of Chen, et. al. [53], Lee [54], Vetterli and Ligtenberg (VL) [55], and Loeffler, et. al. (LLM) [56]. The VL and LLM algorithms are the least complex among known non-scaled designs, and require only 11 multiplication and 26 addition operations. The suitability of each of these factorizations to the design of fixed point IDCT algorithms and their drift characteristic have been extensively analyzed in the course of work for the ISO/IEC standard. Publications [P1][P2][P3] provide more details. In order to implement DCT/IDCT in fixed-point arithmetic, one of the most common and practical technique is based on the approximations of irrational factors αi by dyadic fractions: α a /2 k (2.2) i where both ak and k are integers. In this way, multiplication of x by a factor αi permits the implementation of a very simple approximation in integer arithmetic as follows: k = 0 k

29 13 xα i (( x* ak) >> k) (2.3) k = 0 where >> denotes the bit-wise right shift operation. The exact multiplier-less implementation of the multipliers in the factorization is crucial to the accuracy of the IDCT implementation. If we use F(i) and I(i) to represent the output of a theoretical multiplier and a practical multiplier when the input is i, then one straightforward approach is to find a multiplier-less implementation that minimizes the approximation error, e.g., sum of squared approximation errors (SSAE) and sum of absolute errors (SAE), which is calculated as (2.4) and (2.5) below, respectively: SCALE 2 ( () 2 ()) (2.4) i= 0 SSAE = I i F i SCALE 2 SAE = I() i F() i (2.5) However, there may be some disadvantages of this approach: i= 0 1. The statistics of the input data of different multipliers is different from each other. Moreover, it is difficult to estimate the statistics since it may vary greatly under different situations: different sequence and coding methods, etc. 2. The optimization of implementation of each multiplier is independent from each other. However, in fact, the truncating error among different multipliers will interact with one another through the additions/subtractions in the factorization. Experiments also show that independent optimization of each multiplier using criteria such as (2.4) and (2.5) does not yield the best implementation. Normally brute force search is conducted to search for the best implementation. Typically only the multiplierless implementation with minimum number of additions/subtractions is used for each multiplier and Overall Mean Square Error (OMSE) is selected as the criterion for the optimization problem, see [P3]. Several fixed-point approximations of DCT/IDCT are derived and the drift performance is also analyzed [P3]. 2.3 THE ISO/IEC IDCT The overall architecture used in the design of the ISO/IEC IDCT algorithm is shown in Figure 3, which can be characterized by its separable and scaled features. The scaling stage is performed with a single 8x8 matrix that is precomputed by factoring the 1D scale factors for the row transform with the 1D scale factors for the column transform. The scaling stage is also used to pre-allocate P bits of precision to each of the input

30 14 DCT coefficients. The underlying basis for scaled 1D transform design is a variant of the well-known factorization of Loeffler et. al. [56] with 3 planar rotations and 2 independent factors γ = 2, as shown in Figure 4. This choice has been made empirically based on an extensive analysis of fixed-point designs derived from other known algorithms, including variants of AAN, VL, and LLM factorizations. For more details of the ISO/IEC IDCT, please see [P1]. FIGURE 3 FIXED-POINT 8X8 IDCT ARCHITECTURE. [P1] FIGURE 4 LOEFFLER-LIGTENBERG-MOSCHYTZ IDCT FACTORIZATION (LEFT), AND ITS GENERALIZED SCALED FORM (RIGHT). [P1]

31 15 CHAPTER 3 DIRECTIONAL TRANSFORMS Traditional 2-D DCT/ICT/PIT [P4] are typically implemented by separable 1-D transform in the horizontal and vertical directions. However, this approach does not take image orientation features into account, and thus may not be the most appropriate one. This well known fact has recently triggered several attempts towards the development of directional transforms so as to better preserve the directional information. Some of these directional transforms have been applied in video coding, demonstrating a significant coding gain. Basically, the following properties are desired for a good directional transform in image coding [27]: High efficiency of representing various directional signals. To handle different contents in an image, the transform should be able to exploit the correlations along different directions, and thus the basis vectors should be directional and anisotropic. Low implementation complexity. An ordinary video frame may contain tens of millions of pixels. The transform should be easy to implement to handle such a huge number of pixels. Non-redundant. The transform coefficients should not be redundant. Although the transform itself is not necessary to be a critically-sampled one, it needs a very careful judgment to decide whether or not it should be over-complete. Able to take full advantage of existing coding tools. Image/video coding has evolved for several decades. Many coding tools are heavily tuned and already very mature. It is very difficult to start from the beginning to improve the current state-of-theart. For example, any new transform that can generate coefficients in a similar structure as those transforms used in the current coding schemes would be preferred. Recently, several directional transforms have been proposed and applied in image coding. They can basically be divided into three categories [27]: Reorganization-based directional transforms: pixels in an image block are reorganized according to a selected direction, and a conventional transform is then applied. Lifting-based directional transforms: the directional transforms are obtained by factorizing a conventional transform into lifting operators and performing each lifting operator directionally. Data-dependent directional transforms: the directional transform is constructed by a direction prediction and a corresponding data-dependent transform thereafter.

32 16 More details are provided in the next subsections about each category. A new transform technique called spatially varying transform, which tries to tackle the same problem of traditional 2-D DCT, is also proposed [P5][P6][P7]. 3.1 REORGANIZATION-BASED DIRECTIONAL TRANSFORM The basic idea behind the first category of directional transforms is to reorganize pixels along a certain direction for each 1D transform [8]. The idea is very suitable to block transforms such as DCT. Similar to the directional intra-prediction modes defined in H.264/AVC, eight directional modes are usually used in directional DCT. Two modes are the same as the conventional 2D DCT. Other modes cover diagonal down-left, diagonal down-right, vertical-right, horizontal-down, vertical-left, and horizontal-up; whereas the DC mode in H.264 is not used here. These modes are depicted in Figure 5 below to explain how the directional DCT works. An example of directional DCT is also shown in Figure 6 and Figure 7. FIGURE 5 SIX DIRECTIONAL MODES DEFINED IN A SIMILAR WAY AS WAS USED IN H.264/AVC FOR THE BLOCK SIZE 8 8. THE VERTICAL AND HORIZONTAL MODES ARE NOT INCLUDED HERE. [9]

33 17 FIGURE 6 N N IMAGE BLOCK IN WHICH THE FIRST 1-D DCT WILL BE PERFORMED ALONG THE DIAGONAL DOWN-LEFT DIRECTION. [9] FIGURE 7 EXAMPLE OF N=8: ARRANGEMENT OF COEFFICIENTS AFTER THE FIRST DCT (LEFT) AND ARRANGEMENT OF COEFFICIENTS AFTER THE SECOND DCT AS WELL AS THE MODIFIED ZIGZAG SCANNING (RIGHT). [9] 3.2 LIFTING-BASED DIRECTIONAL TRANSFORMS There are various transforms that can be factorized into lifting operators. The liftingbased implementation of the 8-point DCT is illustrated in Figure 8. The inverse transform can be implemented by reversing the order of lifting operators and the sign of each operator.

34 18 FIGURE 8 FACTORIZING 8-POINT DCT INTO PRIMARY OPERATIONS. X[N] AND Y[N] (N=0,1,,7) ARE INPUT SIGNAL AND OUTPUT DCT COEFFICIENT, RESPECTIVELY. OI (I=1,2,,35) IS PRIMARY OPERATION. [28] The transform can be turned into corresponding directional transforms by performing each lifting operator directionally, as shown in Figure 9. The inverse of a directional lifting operator is also a directional operator, which is along the same direction but with the reverse lifting weight. Concatenating all the inverse lifting operators together, the inverse directional transform can be exactly generated. Based on the lifting-based approach, directional DCT [28] is also proposed to bring directional basis functions into those transforms. These directional functions are generated along the given direction that each lifting operator follows. On the other hand, these directional basis functions still preserve the frequency properties of the DCT. Thus, directional DCT is a direction-frequency analysis of the video signals.

35 19 FIGURE 9 EXEMPLIFIED PRIMARY OPERATIONS (A) NONDIRECTIONAL AND (B) DIRECTIONAL, WHERE WHITE CIRCLES DENOTE INTEGER PIXELS AND GRAY CIRCLES HALF PIXELS. [28] In practical coding using lifting-based directional transforms, the image to be coded is divided into blocks. For each block, one direction from a set of predefined directions is assigned and the transform will be along that direction within the block. 3.3 DATA-DEPENDENT DIRECTIONAL TRANSFORMS In the intra frame coding of H.264/AVC, the intra prediction only removes the directional redundancy among neighboring blocks. It does not exploit the directional correlation within the current block. Thus, after the intra prediction, there are still directional correlations left in the prediction residues. To fully remove the directional redundancy within the image, mode-dependent directional redundancy within the image, modedependent directional transform (MDDT) [29] is proposed in the framework of H.264/AVC. Instead of using just one transform in H.264/AVC, MDDT uses different transforms according to the prediction direction of the image blocks. The transform for each direction, which can be designed to be separable or non-separable, is obtained by training the KLT based on the signals of the same mode. Thus, the transform is data dependent. By applying intra prediction to remove inter-block directional redundancy and MDDT to remove that redundancy within the current block, the whole prediction and transform provide an efficient solution to exploit the directional correlation within H.264/AVC framework.

36 20

37 21 CHAPTER 4 PRE-SCALED INTEGER TRANSFORM Integer Cosine Transform (ICT) was introduced by Cham in 1989 [39] and is further developed in recent years. It has been proved that some ICTs have almost the same compression efficiency as the Discrete Cosine Transform (DCT) but much simpler implementation because only additions and shifts operations are needed [17]. Moreover, ICT can avoid inverse transform mismatch problems of the DCT. Due to these advantages, the latest international video coding standard H.264/AVC adopted order-4 and order-8 ICT transforms [40][41]. In order to reduce the computational complexity, H.264/AVC merges the forward/inverse scaling and the quantization/dequantization operations into one step. Figure 10 is the block diagram for the conventional ICT scheme. FIGURE 10 BLOCK DIAGRAM OF CONVENTIONAL ICT SCHEME IN H.264/AVC The main idea of pre-scaled integer transform (PIT) is that inverse scaling is moved to the encoder side and combined with forward scaling and quantization as a single process. The fact that no scaling is needed at the decoder side distinguishes PIT from conventional ICT, and this is exactly the reason why PIT can reduce the required memory size and computational complexity on decoder side at the same time, as will be explained below. The block diagram for the PIT scheme is shown in Figure 11. FIGURE 11 BLOCK DIAGRAM OF PROPOSED PIT SCHEME To show the advantages of PIT over ICT, let us take order-4 transform as an example first. In H.264/AVC, the QP period is 6 and in decoder a dequantization-scaling matrix QSict is used for 4 4 ICT [10][17][40].

38 22 QS ict = (4.1) For the (i, j)th transform coefficient in a block, the search rule for corresponding dequantization-scaling element in QSict is defined by: QSict ( QP%6,0) for ( i%2, j%2) equal to (0,0) QSict ( QP%6,1) for ( i%2, j%2) equal to (1,1) QSict ( QP%6, 2) else (4.2) where % is modulus operator and x%y means the remainder of x divided by y, defined only for integers x and y with x 0 and y>0. The main problem here is that for every transform coefficient, an operation needs to be conducted, which uses QP and the coordinates of the coefficients in the block, to search for the corresponding elements in QSict. Some computations as in (4.2) are also needed. In order to reduce the computational complexity, and at the same time facilitate parallel processing and take advantage of the efficient multiply/accumulate architecture of many processors, QSict can be fully expanded to QS ict, as in (4.3) and correspondingly the search rule is simplified to (4.4). However, the storage will increase from 6 3=18 bytes (see (4.2)) to 6 4 4=96 bytes (see (4.3)). This memory size will be much larger when 8 8 ICT is also used. In that case, a total memory of =480 bytes should be allocated.

39 ' QS ict = (4.3) QS ( QP%6, i, j) ' ict (4.4) When PIT is used, the dequantization-scaling matrix (or more accurately, the dequantization matrix, since no scaling is included any more) QSpit is as below. T QS pit = [10,11,13,14,16,18] (4.5) And for the (i, j)th transform coefficient in a block, the search rule for the corresponding element in QSpit is further simplified to: QS pit ( QP%6). (4.6) Comparing (4.5), (4.6) to (4.1), (4.2) and (4.3), (4.4) respectively, it is clear that the decoding complexity is reduced with PIT. First, if the dequantization-scaling matrix is fully

40 24 expanded, when order-4 ICT is used, a total memory of 6 4 4=96 bytes can be replaced by a memory of only 6 bytes. The memory required when using PIT is much lower than that required when using conventional ICT. Otherwise, if the dequantization-scaling matrix is not expanded, a total memory of 6 3=18 bytes can be reduced to 6 bytes. Though this saving is trivial considering only order-4 ICT, keeping in mind that in this case every non-zero coefficient needs a lookup operation and some extra operations when using ICT, the computational complexity of PIT will be lower because 3-D lookup operation is replaced by 1-D lookup operation which only uses QP, and at the same time no other extra operations as in (4.2) is needed. In either case, at the decoder side, the required memory size can be reduced and the 3-D lookup operation can always be replaced by a 1-D lookup operation thus pipeline and parallel processing is facilitated and extra computation or storage memory can be saved. While at encoder side, since forward scaling, inverse scaling and quantization are combined as a single process and the original quantization-scaling matrix is replaced by a combined-scaling-quantization matrix with the same size, both the computational and storage complexity remain unchanged. Only order-4 transform is considered in the discussion above. When both order-4 and order-8 transforms are used, a total memory of =480 bytes can be saved and a memory of only 6 bytes is needed instead, assuming that the order-4 and order-8 PITs employed are compatible. 4.1 DESIGN RULES OF PIT KERNELS Just like ICT, not every PIT kernel performs well. It is practically important to find design rules that lead to good PIT kernels. The following two factors should be considered when choosing a PIT kernel. Compression Ability: It is proved in [P4] that PIT kernels are the same as the corresponding ICT kernels in terms of transform coding gain [42] and DCT frequency distortion [43]. So basically, good ICT kernels can and should be used to obtain good PIT kernels. Weighting Factors of Transform Coefficients: It is shown in [P4] that the PIT coefficient matrix can be regarded as a weighted ICT coefficient matrix where Si is the weighting matrix. While all coefficients of an orthogonal ICT have the same maximum and minimum values, those of PIT do not because the weighting matrix Si changes the relative magnitudes of the transform coefficients. The Weighting Factor Difference (WFD) of a PIT represents the difference of weighting factors applied to different transform coefficients:

41 WFD max S( kl, ) 25 i kl, =. (4.7) min S( kl, ) kl, i The deviation of the WFD of a PIT away from unity may cause a problem. It may result in truncation of some transform coefficients that could be retained if Si=αIf where α is a scalar and If is a matrix with all its elements equal to 1. However, unfortunately the condition Si=αIf cannot always be true when PIT is used. In order not to change the transform coefficients too much so as to retain the compression ability of the corresponding ICT kernels, elements of the weighting matrix Si should have values close to one another. Another advantage of this consideration is that the change from ICT to the corresponding PIT will be small so that other parts of a codec need not be re-designed: the scan order of the transform coefficients need not be changed and the entropy coding table designed for the ICT can also be re-used for the PIT without major changes. Based on the discussion above, it can be concluded that a good PIT kernel can be obtained following the steps below: 1) Obtain a good ICT kernel. To derive good ICT kernels, computer search can be systematically carried out based on transform coding gain and DCT frequency distortions. 2) Choose a good ICT kernel whose WFD is not larger than 2, which is empirically derived [P4]. 4.2 TYPES OF PRE-SCALED INTEGER TRANSFORM Besides the design rules derived above, different types of PITs have different characteristics and are suitable for different applications, which is also an important issue in practice. There are generally two types of PITs based on Frequency Scale Factor (FSF). For order-n PIT, FSF measures the scaling effect of Si on the transformed coefficients and is defined as: FSF 1 = 2 N i, j Si (, i j) Si (0,0). (4.8) 1) Type-I PITs, whose FSF is less than 1, have a characteristic that more high-frequency components may be quantized out compared to the corresponding ICT. Type-I PITs may be more suitable for video streaming and conferencing applications where low resolution (such as QCIF and CIF) video sequences are usually used and coded at relatively low bit-rates because Type-I PITs lead to bit savings without degrading

42 26 the subjective quality significantly in this situation. 2) Type-II PITs, whose FSF is larger than or equal to 1, have a characteristic that more high-frequency components may be retained after quantization compared to the corresponding ICT. Type-II PITs may be more suitable for entertainment quality and other professional applications where higher resolution (such as HD) video sequences are often used and coded at relatively high bit-rates. Experiments in [P4] show that no coding efficiency penalty is observed but even some improvement can be obtained compared to H.264/AVC when PIT scheme is carefully designed. Because of this, PIT has been adopted in Audio Video coding Standard (AVS), the Chinese National Coding standard with PIT8[10,9,6,2;10,4;8] in AVS part 2 and PIT4[3,1;2] in AVS part 7, respectively, due to its implementation complexity reduction and good performance [P4] [6].

43 27 CHAPTER 5 SPATIALLY VARYING TRANSFORM Typical block based transform design in most existing video coding standards does not align the underlying transform with the possible edge location. In this case, the coding efficiency decreases. As introduced in chapter 2, directional transforms are proposed to improve the efficiency of transform coding for directional edges. However, efficient coding of horizontal, vertical and non-directional edges was not tackled; whereas typically, in natural video sequences, such edges are more common than directional edges. This is the main reason why only very marginal overall gain has been achieved in some previous work by other researchers [9]. Simple examples are given in [P6] which show that by aligning the transform to the edge location, more energy in the transform domain concentrates in the low frequency part and this facilitates the entropy coding and improves the coding efficiency. Furthermore, coding the entire prediction error signal may not be the best in terms of rate distortion tradeoff. An example is the SKIP mode in H.264/AVC [10], which does not code the prediction error at all. This fact is useful in improving the coding efficiency especially for HD video coding since generally a certain image area in HD video sequences contains less detail than in lower resolution video sequences, and the prediction error tends to have lower energy and to be more noise-like after better prediction. A new Spatially Varying Transform (SVT) was developed in [P5] to take advantage of the facts stated above in order to improve coding efficiency. The basic idea of SVT is that the transform coding is not restricted inside regular block boundaries but is adjusted to the characteristics of the prediction error. With this flexibility, coding efficiency improvement can be achieved by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. Generally, this can be done by searching inside a certain residual region after intra prediction or motion compensation, for a sub-region and only coding this sub-region. Finally, the location parameter (LP) indicating the position of the sub-region inside the region is coded into the bitstream if there are non-zero transform coefficients. There are two ways to interpret SVT: 1. SVT can be considered as a region-of-interest (ROI) coding technique [37] where the region of interest is selected in the prediction error domain and only the region of interest is coded. 2. When the region refers to a macroblock, the proposed algorithm can be considered as a special SKIP mode, where part of the macroblock is skipped. In our case, a single block is selected and coded inside a macroblock when applying SVT. Extension to multiple blocks is straightforward. The selection is due to the fact that

44 28 coding on a macroblock basis reduces complexity, delay and throughput while facilitating the implementation especially for hardware. However, it should be noted that there is no restriction on the size and shape of a sub-region and a region and thus the proposed idea could easily be extended to cover arbitrary sizes and shapes. For instance, SVT can be applied to an extended macroblock [15] of size larger than Directional spatially varying transforms with a directional oriented block selected and coded with a corresponding directional transform (e.g., [9]) can also be used. 5.1 DESIGN OF SVT In the following sub-sections, three key issues of SVT are discussed in more detail: selection of SVT block-size, selection and coding of candidate LP and filtering of SVT block boundaries SELECTION OF SVT BLOCK-SIZE In our implementation, an M N SVT is applied on a selected M N block inside a macroblock of size 16 16, and only this M N block is coded. It is easy to see that there are in total (17-M) (17-N) possible LPs if the M N block is coded by an M N transform. The following factors are taken into account when choosing suitable M and N: 1) Generally, larger M and N will result in fewer possible LPs and less overhead; and vice versa. 2) Also, larger M and N will result in lower distortion of the reconstructed macroblock but need more bits in coding the transform coefficients; and vice versa. 3) Finally, a larger block-size transform is more suitable in coding relatively flat areas or non-sharp edges, without introducing coding artifacts such as ringing, while a smaller block-size transform is more suitable in coding areas with details or sharp edges. To facilitate the transform design, it can be further assumed that M=2 m, N=2 n where m and n are integers ranging from 0 to 4 inclusive in our context 1. In our case, 8 8 SVT, 16 4 SVT and 4 16 SVT are used and are illustrated in Figure 12 and Figure 13 respectively, to show the effectiveness of the proposed algorithm. Nevertheless, it should be noted that, for different sequences with different characteristics, other block-size SVT 1 Note that SKIP mode is a special case when m and n are both equal to 0 and the macroblock partition is (and the motion vector equals the predicted value in the context of H.264/AVC), and transform is another special case when m and n are both equal to 4.

45 can be more suitable in terms of coding efficiency, complexity, etc. Using VBT [16] in the framework of SVT, namely variable block-size SVT (VBSVT), the prediction error is better localized and the underlying correlations are better exploited, thus the coding efficiency is improved compared to a fixed block-size SVT. 16 pixels pixels x y 8 8 8x8 SVT Transform Block Location: ( x, y) FIGURE 12 ILLUSTRATION OF 8 8 SPATIALLY VARYING TRANSFORM. 16 pixels 16 pixels 16 pixels Δy 16x4 SVT Transform Block Location: y 4 16 pixels Δx 4x16 SVT Transform Block Location: x 4 FIGURE 13 ILLUSTRATION OF 16 4 AND 4 16 SPATIALLY VARYING TRANSFORM. The transforms used for 8 8, 16 4 and 4 16 SVTs are described below. In general, a separable forward and inverse orthonormal 2-D transform of a 2-D signal can be written as C = T X T, (5.1) v 1 h X = T CT (5.2) 1 r v h respectively, where X denotes a matrix representing M N pixel block, C is the transform coefficient matrix, and Xr denotes a matrix representing the reconstructed signal block. Tv and Th are the M M and N N transform kernels in the vertical and the horizontal direction, respectively. The superscript t denotes matrix transposition. For 8 8 SVT, we

46 30 simply reuse the 8 8 transform kernel in H.264/AVC [10]. For 16 4 SVT and 4 16 SVT, we use the 4 4 transform kernel in H.264/AVC [10][17] and the transform kernel in [18] because it is simple and can reuse the butterfly structure of the existing 8 8 transform in H.264/AVC. In all cases, normal zig-zag scan (for frame based coding which is used in our experiments) is used to represent the transform coefficients as input symbols to the entropy coding SELECTION AND CODING OF CANDIDATE LPS When there are non-zero transform coefficients of the selected SVT block, its location inside the macroblock needs to be coded and transmitted to the decoder. For 8 8 SVT, as shown in Figure 12, the location of the selected 8 8 block inside the current macroblock can be denoted by (Δx, Δy) which can be selected from the set Φ8 8={(Δx, Δy), Δx, Δy=0,,8}, There are in total 81 candidates for 8 8 SVT. For 16 4 SVT and 4 16 SVT, as shown in Figure 13, the locations of the selected 16 4 and 4 16 block inside the current macroblock can be denoted as Δy and Δx, respectively, which can be selected from the set Φ16 4={Δy, Δy=0,,12} and Φ4 16={Δx, Δx=0,,12} respectively. There are in total 26 candidates for 16 4 SVT and 4 16 SVT. The best LP is then selected according to a given criterion. In our case, rate distortion optimization (RDO) [19] is used to select the best LP in terms of RD tradeoff by minimizing the following: J = D + λ R (5.3) where J is the RD cost of the selected configuration, D is the distortion, R is the bit rate and λ is the Lagrangian multiplier. The reconstruction residue for the remaining part of the macroblock is simply set to be 0 in our implementation, but different values can be used and might be beneficial in certain cases (luminance change, etc). The set of candidate LPs is also important since it directly affects the encoding complexity and the performance of SVT. Larger number of candidate LPs provides more room and is more robust for coding efficiency improvement for difference sequences with different characteristics, but adds more encoding complexity and also more overhead. Experimentally, according to criterion (5.3), we choose to use Φ8 8 ={(Δx, Δy), Δx=0,,8, Δy=0; Δx=0,,8, Δy=8; Δx=0, Δy=1,,7; Δx=8, Δy=1,,7} for 8 8 SVT and Φ16 4={Δy, Δy=0,,12} and Φ4 16={Δx, Δx=0,,12} for 16 4 SVT and 4 16 SVT, which shows good RD performance over a large test set [P5]. There are in total 58 candidate LPs, and statistics show that the measured entropy of the LP index is 5.73 bits, for all test sequences used in the experiments. Accordingly, the LP index is represented by a 6- bit fixed length code in our implementation. As the overhead bits to code the LPs become significant at low bitrates, it would be useful to choose different LPs at different

47 QPs and different bitrates. Similarly, an indication of available LP can also be coded and transmitted in the slice header in H.264/AVC framework. The LP for the luma component can be used to derive the corresponding LPs for chroma components. Consider the 4:2:0 chroma format as an example, LPs for chroma components can be derived from the LP for the luma component as follows: 31 x = ( x + 1) >> 1, y = ( y + 1) >> 1. (5.4) C L where ΔxL, ΔyL are the LPs for the luma component, while ΔxC, ΔyC are the LPs for the chroma components. C L FILTERING OF SVT BLOCK BOUNDARIES Due to the coding (transform and quantization) of the selected SVT block, coding artifacts may appear around its boundary with the remaining non-coded part of the macroblock. A deblocking filter can be designed and applied to improve both the subjective and objective quality. The basic idea is to adjust the deblocking filter to different possible locations of the SVT block. An example in the framework of H.264/AVC is described in detail in [P6]. 5.2 IMPLEMENTING SPATIALLY VARYING TRANSFORM IN THE H.264/AVC FRAMEWORK The proposed technique is implemented and tested in the H.264/AVC framework. Figure 14 shows the block diagram of the extended H.264/AVC coder with SVT, compared to the original H.264/AVC coder shown in Figure 2. The encoder first performs motion estimation and decides the optimal mode to be used, and then searches for the best LPs (illustrated as SVT Search in the diagram) if SVT is used for this macroblock. The encoder then calculates the RD cost for using SVT using (5.3), and if lower RD cost can be achieved by using SVT then that macroblock is selected to be coded with SVT. For macroblocks that use SVT, the LPs are coded and transmitted in the bitstream when there are non-zero transform coefficients of the selected SVT block. The LPs for the macroblocks that use SVT are decoded in the box marked as SVT L.P. Decoding in the diagram. However, the normal criteria, e.g., sum of absolute differences (SAD) or sum of squared differences (SSD), which is used in these encoding processes for macroblocks that do not use SVT, may not be optimal for macroblocks that use SVT. Better encoding algorithms are under study.

48 32 Coder Control Input Video + - SVT Search SVT Selection Transform/ Quant. Control Data Quant. Trans. Coeffs. SVT L.P. Decoder SVT Selection SVT L.P. Decoding Inv. Quant./Inv. Transform + Entropy Coding Output Bitstream Intra / Inter Intra-frame Prediction Motion Compensation Deblocking Filter Frame Buffer Output Video Motion Data Motion Estimation FIGURE 14 BLOCK DIAGRAM OF THE EXTENDED H.264/AVC ENCODER WITH SPATIALLY VARYING TRANSFORM. Several key parts of the H.264/AVC standard [10], for instance, macroblock types, coded block pattern (CBP), entropy coding, and deblocking, need to be adjusted. The proposed modifications aiming at good compatibility with H.264/AVC are described in [P6]. 5.3 FAST ALGORITHMS FOR SPATIALLY VARYING TRANSFORM As described above, it should be noted that even though the motion estimation, submacroblock partition decision process are not changed for the macroblocks that use SVT, the encoding complexity of SVT is higher due to the brute force search process in RDO. Thus, for application cases that have strict requirement of encoding complexity, fast algorithm of SVT should be developed and used. In this section, a simple yet efficient fast algorithm operating on macroblock and block level is proposed to reduce the encoding complexity of SVT. The basic idea to reduce the encoding complexity of SVT is to reduce the number of candidate LPs tested in RDO. The proposed fast algorithm first skips testing SVT for macroblocks for which SVT is unlikely to be useful by examining the RD cost of macroblock modes without SVT on a macroblock level. For the remaining macroblocks that SVT may be useful, the proposed fast algorithm selects available candidate LPs based on the motion difference and utiliz-

49 es a hierarchical search algorithm to select the best available candidate LP on a block level. The macroblock level and block level algorithms are described in the following sections MACROBLOCK LEVEL FAST ALGORITHM The basic idea of macroblock level fast algorithm is to skip testing SVT for macroblock modes for which SVT is unlikely to be useful. This is done by examining the RD cost of macroblock modes that do not use SVT and are already available prior to SVT coding. In the proposed implementation, SVT is only applied for the macroblock modes in RDO process, when the following two criteria are met: min( J J, (5.5) inter, J skip ) intra J mode min( J inter, J skip ) + th (5.6) where Jinter and Jintra are the minimum RD cost of available inter and intra macroblock modes in regular (without SVT) coding, respectively, Jskip is the RD cost of the SKIP coding mode and Jmode is the RD cost of the current macroblock mode to be tested with SVT (e.g If SVT is being tested for INTER_16 8 mode, then Jmode refers to RD cost of the regular INTER_16 8 without SVT). The threshold th in (5.6) represents the empirical upper limit of bit-rate reduction when SVT is used assuming the reconstruction of macroblock remains the same when SVT is not used. It is calculated using (5.7), th = λ max( 23 QP / 2, 0), (5.7) where λ is the Lagrangian multiplier used in (5.3) and the max function returns the maximum value of its arguments. More details can be found in [P6] BLOCK LEVEL FAST ALGORITHM The proposed block level fast algorithm includes two steps: the selection of available candidate LPs based on the motion difference in the first step and a hierarchical search algorithm in the second step. These two steps are described in detail next. 1) Selection of available candidate LPs based on motion difference: As described in section 5.1.1, SVT is used for 16 16, 16 8, 8 16 and 8 8 macroblock partitions. One straightforward approach to reduce the encoding complexity is to restrict the transform block in a candidate SVT block to be inside the same motion compensation block boundary, since applying the transform across motion compensation block boundaries is usually considered not efficient. However, some penalty was observed in coding efficiency for sequences with slow motion and rich detail, for instance,

50 34 Preakness and Night. Taking this into account, a more general approach is used in this work, by skipping the testing of candidate LPs when and only when the transform block(s) in the SVT block at that position overlaps with different motion compensation blocks in which the motion difference is significant according to a certain criterion. In order to measure the motion difference we use a similar method to the one used in deriving the boundary strength parameter in deblocking filter of H.264/AVC [10][20]. Specifically, in our implementation, we skip testing a candidate LP and mark it unavailable if at least one of the following conditions is true: If the transform applied to the SVT block at that position overlaps with at least two neighboring motion compensation blocks and the motion vectors of these blocks are larger than or equal to a pre-defined threshold which is set to be one integer pixel in this work. If the transform applied to the SVT block at that position overlaps with at least two neighboring motion compensation blocks and the reference frames of these two neighboring blocks are different. Since the number of available candidate LPs varies from one macroblock to another and the information to derive the available candidate LPs can be obtained both by the encoder and the decoder, the index of the selected LP is coded as follows. Assume there are N (N>0) available candidate LPs and the index of the selected candidate is n (0 n<n), then it is coded as V V = n, L = log = n ( N 2 2 N log 2 N + 1, ), L = if n < 2( N 2 log 2 N, log 2 N ) otherwise (5.8) where V represents the binary value of the coded bit string and L represents the number of bits coded. This is a near fixed-length code which is chosen based on the observation that all available candidate LPs are equally likely to be used. This approach shows a stable coding efficiency for sequences with different characteristics and achieves similar gain over a wide range of test sets compared to the original algorithm without using the information of motion difference. The additional complexity introduced by this approach is marginal when it is carefully implemented because: a) the decision is only conducted when necessary and can be skipped for macroblock modes with partition and some LPs, e.g. the LP (0,0) for 8 8 SVT; b) the decision is simple and only uses the motion vector and the reference frame information; c) generally several candidate LPs representing spatially consecutive blocks can be marked available or unavailable at the same time in one decision. 2) Hierarchical search algorithm: The basic idea of the hierarchical search algorithm is

51 similar to that of the motion estimation algorithm, i.e., to first find the best LP in a relatively coarse resolution and then refine the results in a finer resolution. In our implementation, we define the candidate LP in coarser resolution as key candidate LP, which are marked as squares in Figure 15. Other candidate LP is also called as non-key candidate LP, which are marked as circles in Figure 15. The hierarchical search algorithm can be summarized as follows. 1. Let Φ1 denote the set of all available key candidate LPs. Select the best one in Φ1 with the lowest RD cost and let Φ2 denote its available neighboring candidate LPs which are marked as triangles in Figure The key LPs are divided into 14 LP zones as shown in Figure 15. Select the best LP zone which is available and has the lowest RD cost. A LP zone is available if and only if all three key candidate LPs in that zone are available and the RD cost of an LP zone is defined to be the sum of the RD costs of the three key candidate LPs in that zone. Let Φ3 denote the additional available non-key candidate LPs which are inside the best LP zone (marked as stars in Figure 15). 3. Select the best LP which has the lowest RD cost among all the candidates in Φ1, Φ2 and Φ3. 35 Zone 0 Zone 1 Best LP zone Zone 7 Zone 6 Zone 5 Zone 4 Zone 8 Zone 9 Zone 10 Zone 2 Zone 3 Zone 11 Zone 12 Zone 13 Best key LP 8x8 SVT 16x4 SVT 4x16 SVT FIGURE 15 ILLUSTRATION OF HIERARCHICAL SEARCH ALGORITHM. 5.4 EXPERIMENTAL RESULTS We implemented the proposed algorithm VBSVT and its fast version with the proposed fast algorithm, which we denote as FVBSVT, in KTA1.8 reference software [22] in order to evaluate their effectiveness. The most important coding parameters used in the experiments are listed below: High Profile

52 36 QPI=22, 27, 32, 37, QPP=QPI+1, fixed QP is used and rate control is disabled according to the common conditions that are set up by ITU-T VCEG and ISO-IEC/MPEG community for coding efficiency experiments [23]. pel CAVLC/CABAC is used as the entropy coding Frame structure is IPPP, 4 reference frames Motion vector search range: ±64/±32 pels for 720p/CIF sequences, resolution ¼- RDO in the High Complexity Mode Two configurations are tested. 1) Low complexity configuration: motion estimation block sizes are 16 16, 16 8, 8 16, 8 8, and 4 4 transform is not used for macroblocks coded with regular transform or SVT. This represents a low complexity codec with most effective tools for HD video coding; 2) High complexity configuration: motion estimation block sizes are 16 16, 16 8, 8 16, 8 8, 8 4, 4 8, 4 4, and 4 4 transform is also used as an optional transform for both macroblocks coded with regular transform or SVT. This represents a high complexity codec with full usage of the tools provided in H.264/AVC. We measure the average bitrate reduction (ΔBD-RATE) compared to H.264/AVC using Bjontegaard tool [24] for both low complexity and high complexity configurations. The results are shown in Table 2. The results of 8 8 SVT and SVT are also shown for comparison. Figure 16 shows the R-D curves for sequences Night and Panslow. We note that the gain of VBSVT comes at medium to high bitrates for many tested sequences and can be much more significant than the average gain over all bitrates reported above. This is true for most test sequences used in our experiment.

53 37 Night PSNR (db) H.264 (low complexity) H.264 (low complexity)+ VBSVT H.264 (high complexity) H.264 (high complexity)+ VBSVT Bitrate (kbit/s) 39 Panslow PSNR (db) H.264 (low complexity) H.264 (low complexity) + VBSVT H.264 (high complexity) H.264 (high complexity) + VBSVT Bitrate (kbit/s) FIGURE 16 RATE-DISTORTION CURVES FOR NIGHT SEQUENCE (CABAC) AND PANSLOW SEQUENCE (CAVLC). We also measure the percentage of 8 8 SVT and SVT selected in VBSVT. The results are shown in Table 3. We can see that in VBSVT scheme, SVT are used generally more often than 8 8 SVT. However, 8 8 SVT is also used in a significant portion of the cases. In Figure 17 and Figure 18, we show the distribution of the macroblocks coded with SVT and corresponding SVT blocks in a prediction error frame for Night and Sheriff sequences, respectively. It can be observed that, as expected, SVT is mostly useful for macroblocks where residual signals of larger magnitude are likely to gather and locate at certain regions, which frequently happens, as also observed by other researchers [26]. We also view the reconstructed video sequences in order to analyze the impact of SVT on subjective quality. We conclude that using SVT does not introduce any special

54 38 visual artifacts and the overall subjective quality remains similar, while the coded bits are reduced. Similarly, we measure the average bitrate reduction (ΔBD-RATE) of FVBSVT compared to H.264/AVC, We also measure the average reduction of candidate LPs tested in RDO over all the tested QPs using FVBSVT. The results are also shown in Table 3. We can see that with the proposed fast algorithm we can reduce by a little more than 80% the number of candidate LPs tested in RDO while retaining most of the coding efficiency, for both low and high complexity configurations. This means that on average about candidate LPs are tested in RDO (each RD cost calculation includes transform, quantization, entropy coding, inverse transform, inverse quantization etc) when FVBSVT is used. We also measured the execution time of FVBSVT to estimate the computational complexity. Compared to H.264/AVC, the encoding and decoding time of FVBSVT increases on average 2.49% and 2.05%, respectively, over 720p test set and all the tested QPs in high complexity configuration. The encoding and decoding time increase vary for different test sequences and detailed results are shown in Table 4. TABLE 2 AVERAGE RESULTS OF ΔBD-RATE COMPARED TO H.264/AVC Average Results 8 8 SVT SVT VBSVT FVBSVT % Reduction of candidate LPs Average (CAVLC, 720p, low complexity configuration) -2.64% -3.65% -4.11% -3.69% 83.4 Average (CAVLC, CIF, low complexity configuration) -2.44% -3.68% -4.14% -3.55% 83.6 Average (CAVLC, 720p, high complexity configuration) -1.42% -2.33% -2.43% -2.34% 81.1 Average (CAVLC, CIF, high complexity configuration) -1.28% -1.88% -2.08% -1.72% 81.4 Average (CABAC, 720p, low complexity configuration) -4.29% -4.93% -5.85% -5.31% 84.3 Average (CABAC, CIF, low complexity configuration) -3.04% -3.87% -4.66% -3.94% 83.6 Average (CABAC, 720p, high complexity configuration) -3.26% -3.59% -4.40% -4.03% 82.8 Average (CABAC, CIF, high complexity configuration) -1.52% -2.03% -2.45% -2.01% 81.9 TABLE 3 AVERAGE RESULTS OF PERCENTAGE OF 8 8 SVT AND SVT SELECTED IN VBSVT FOR DIFFERENT SEQUENCES Average Results Low complexity High complexity

55 39 configuration configuration 8 8 SVT SVT 8 8 SVT SVT Average (CAVLC, 720p) 45.8% 54.2% 43.9% 56.1% Average (CAVLC, CIF) 44.4% 55.6% 43.1% 56.9% Average (CABAC, 720p) 50.2% 49.8% 52.3% 47.7% Average (CABAC, CIF) 47.5% 52.5% 47.2% 52.8% TABLE 4 ENCODING/DECODING TIME INCREASE OF FVBSVT COMPARED TO H.264/AVC (HIGH COMPLEXITY CONFIGURATION) (CABAC, 720P) Sequence % Encoding time increase of FVBSVT % Decoding time increase of FVBSVT BigShips ShuttleStart City Night Optis Spincalendar Cyclists Preakness Panslow Sheriff Sailormen Average

56 40 (A) FIGURE 17 (A) DISTRIBUTION OF THE MACROBLOCKS CODED WITH SVT, (B) CORRESPONDING SVT BLOCKS FOR 32ND PREDICTION ERROR FRAME. (NIGHT SEQUENCE) (B)

57 41 (A) FIGURE 18 (A) DISTRIBUTION OF THE MACROBLOCKS CODED WITH SVT, (B) CORRESPONDING SVT BLOCKS FOR 65TH PREDICTION ERROR FRAME. (SHERIFF SEQUENCE) (B)

58 42

59 43 CHAPTER 6 PREDICTION SIGNAL AIDED SPATIALLY VARYING TRANSFORM This chapter is based on Publications [P7] and dedicated to the design and implementation of prediction signal aided spatially varying transform (PSASVT), which reduces the encoding complexity and improves the coding efficiency of the original SVT algorithm with only a little increase in the decoding complexity. AS mentioned in chapter 4, the encoding complexity for SVT is increased because the encoder needs to search for the best LP among a number of candidate LPs. The basic idea to reduce the encoding complexity of SVT is to reduce the number of candidate LPs tested in RDO. In this chapter, we propose to select candidate LPs based on the prediction signal. The motivations leading to this design are two-fold: 1. Statistics show that normally the selected SVT block locates at positions of the macroblock where the prediction error has higher magnitude [P5][P6]. These positions are probably boundary positions [30][38]. One example is illustrated in Figure 19 below [P5]. 2. The gradient of the prediction signal is assumed to be positively correlated with the magnitude of the prediction error, i.e., a larger gradient indicates a higher prediction error magnitude. In other words, it is probable that when the gradient of the prediction signal is high, the texture of the original signal is more complex and hard to predict well thus the prediction error has a relatively high magnitude. A similar assumption is also used in [31]. FIGURE 19 DISTRIBUTION OF (ΔX, ΔY) OF 8 8 SVT FOR BIGSHIPS SEQUENCE [P5].

60 IMPLEMENTATION OF PREDICTION SIGNAL AIDED SPATIALLY VARYING TRANSFORM From the above motivations, we may utilize the gradient information of the prediction signal to select the candidate LPs. Corresponding to (5.3), a general way would be to use an estimated RD cost of the whole macroblock for each candidate LP from the prediction signal, which can be done by minimizing the following: J = D + λ R (6.1) e e e e where Je is the estimated RD cost of the selected configuration, De is the estimated distortion which can be estimated from the gradient outside the SVT block (i.e., assume that there is no or relatively negligible quantization effect etc inside the SVT block and thus no or relatively negligible distortion 2 introduced therein), Re is the estimated rate which can be obtained from the gradient inside the SVT block and λe is the Lagrangian multiplier used in this case which can be experimentally derived using the method in [32][33]. It turns out that for each candidate LP, the RD cost Je depends only on Gsvt which is the gradient of the prediction signal inside the SVT block [P7]. In our implementation, we use the sum of the gradient amplitude of the prediction signal inside the SVT block as an estimated RD cost. This is found experimentally efficient in terms of tradeoff of complexity and performance. Based on the above analysis, the proposed algorithm PSASVT is summarized below: Step 1: Calculate the gradient map of the prediction signal of the current macroblock. In our implementation, we use Sobel operator, which is considered widely to be a good edge detector, to calculate the gradient as follows. Gxy (, ) = Sx ( 1, y 1) + 2 Sxy (, 1) + Sx ( + 1, y 1) Sx ( 1, y+ 1) 2 Sxy (, + 1) Sx ( + 1, y+ 1) + Sx ( 1, y 1) + 2 Sx ( 1, y) + Sx ( 1, y+ 1) Sx ( + 1, y 1) 2 Sx ( + 1, y) Sx ( + 1, y+ 1) (6.2) where G(x,y) is the gradient amplitude and S(x,y) is the prediction signal value at position (x,y). Step 2: If the gradient map is all-zero map, then in this case we assume the prediction quality is very good and thus we only try 8 corner positions at which the prediction error is assumed to be larger than other positions, i.e., the set Φc= Φ8 8 +Φ16 4 +Φ4 16, where Φ8 8 ={(Δx, Δy), Δx=0, Δy=0; Δx=8, Δy=0; Δx=0, Δy=8; Δx=8, Δy=8;}; Φ16 4 ={Δy, Δy=0, 12}; Φ4 16 ={Δx, Δx=0, 12}; then goto step 5. 2 This rough assumption is based on the following: 1. the SVT block is coded and the remaining part of the macroblock is not coded; 2. the size of the SVT block is only 1/3 of the size of the remaining non-coded part of the macroblock.

61 Step 3: Otherwise, for each candidate LP, calculate the sum of gradient (denoted by a variable SoG) of the prediction signal inside the SVT block at that position. Step 4: Select some candidate LPs to test in RDO following these two sub-steps: (a): Calculate a threshold SoGt as follows: 45 SoGt = (SoGmax+SoGmin+1)>>1; (6.3) where SoGmax and SoGmin are the maximum and minimum value of SoG among all the candidate LPs, respectively. Generally, the calculation of the threshold SoGt can be dependent on the statistics of SoG of all the candidate LPs, and/or other characteristics of the current (and neighboring) macroblock. A larger threshold SoGt can reduce more candidate LPs tested in RDO, but on the other hand may degrade the coding efficiency because the selected candidate LPs may not be among the best ones in terms of rate distortion tradeoff, and vice versa. Eq. (6.3) for calculating the threshold SoGt is derived according to our experience which shows a good tradeoff between performance and complexity. (b): A candidate LP is selected to be tested if SoG for this candidate LP is larger than or equal to SoGt. Besides the above method we also tried the following variations to select the candidate LPs to test and the results are also given for comparison. The basic idea here is to utilize different statistics of SoG of all the candidate LPs to derive the threshold SoGt. (i) We calculate the average value of SoG of all the candidate LPs as the threshold: N SoGt = SoGi / N i= 1 (6.4) where N is the number of all the candidate LPs. We then test the candidate LPs whose SoG is larger than or equal to SoGt. (ii) We use the median value of SoG of all the candidate LPs as the threshold SoGt and test the candidate LPs whose SoG is larger than or equal to SoGt. (iii) We first sort the SoG of all the candidate LPs into non-increasing order and calculate the difference of the neighboring SoG in the ordered list. Then search the largest difference from the beginning of the ordered list which for example is in the i-th (1 i N, where N is the number of all the candidate LPs) position of the list. Set the threshold SoGt to be SoG in the i-th position of the ordered list. In other words, we use the largest difference of ordered SoG as a criterion to derive the threshold SoGt. Step 5: End.

62 EXPERIMENTAL RESULTS We implemented the proposed idea PSASVT on Tandberg, Ericsson and Nokia test model (TENTM) [34][35] in order to evaluate its effectiveness. TENTM was a joint proposal to the High-Efficiency Video Coding (HEVC) standardization effort, which achieves similar visual quality measured using Mean Opinion Score (MOS) to H.264/AVC High Profile anchors, and has a better tradeoff between complexity and coding efficiency than H.264/AVC Baseline Profile [34][35]. Because of this, it is selected as the platform in our experiments. The important coding parameters are as follows: Quad-tree based coding structure with a support for macroblocks of size 64 64, and pixels; Frame structure is IPPP; QPI=11, 16, 21, 26, QPP=QPI+2; Low complexity variable length coding scheme with improved context adaptation [34][35]; The proposed algorithm is only used for motion compensation partition. For other motion compensation partitions, a previously developed algorithm of selecting available candidate LP based on motion difference [P5][P6] is used. The anchor is the TENTM when SVT is turned off. We measure the average bitrate reduction (ΔBD-RATE) compared to the anchor according to Bjontegaard metric [24], recommended by the VCEG, which is an average difference between two RD curves. The set of sequences tested in this chapter corresponds to the Joint Collaborative Team on Video Coding (JCT-VC) test set [36] extended by 6 additional sequences. The experimental results are summarized in Table 5 and Table 6. The detailed results are reported in [P7]. TABLE 5 EXPERIMENTAL RESULTS OF PSASVT Sequences SVT PSASVT_MID PSASVT_AVG PSASVT_MED PSASVT_DIFF (%) BDrate BDrate (%) Reduction of LPs tested in RDO (%) BDrate (%) Reduction of LPs tested in RDO (%) BDrate (%) Reduction of LPs tested in RDO (%) BDrate (%) Reduction of LPs tested in RDO (%) Avg 2560x Avg 1080p Avg 720p Avg 832x

63 47 Avg 416x Avg 720p extra Avg all TABLE 6 PERCENTAGE OF MACROBLOCKS CODED IN SVT FOR DIFFERENT SEQUENCES Sequences SVT PSASVT_MID PSASVT_AVG PSASVT_MED PSASVT_DIFF All partitions (%) partition (%) All partitions (%) partition (%) All partitions (%) partition (%) All partitions (%) partition (%) All partitions (%) partition (%) Avg 2560x Avg 1080p Avg 720p Avg 832x Avg 416x Avg 720p extra Avg all We have tested four variations of the proposed algorithm, namely: (i) PSASVT_MID: use (6.3) to calculate the threshold SoGt; (ii) PSASVT_AVG: use (6.4) to calculate the threshold SoGt; (iii) PSASVT_MED: use the method as explained in step 3 (ii) of the algorithm to calculate the threshold SoGt; (iv) PSASVT_DIFF: use the method as explained in step 3 (iii) of the algorithm to calculate the threshold SoGt. From the experimental results, we can see that encoding complexity reduction and slight coding efficiency improvement are achieved for all four cases. In a general sense, PSASVT_MID performs better than PSASVT_AVG and PSASVT_MED, and can reduce on average 21.70% of the candidate LPs tested in RDO, while also achieves on average 0.18% bitrate reduction. PSASVT_DIFF can reduce most of the candidate LPs tested in RDO (on average 24.46%) among all the cases, while achieves slightly less bitrate reduction (on average 0.13%). The complexity of PSASVT_DIFF is higher than others mainly because it needs to sort the SoG of all the candidate LPs into non-increasing order and calculate the difference of the neighboring SoG in the ordered list. On the contrary, the complexity of PSASVT_MID could be (one of) the lowest among all four cases. For all the cases, the reduction in encoding complexity and slight coding efficiency improvement are achieved with only a little complexity increase in decoder. The decoding complexity increase is mainly due to the calculation of the gradient map of the macroblock and the sum of gradient of the prediction signal inside the SVT block for each candidate LP. This needs only to be conducted when motion compensation partition and SVT are used, which is typically only a small percent (less than 5% on average)

64 48 of all the macroblocks in a sequence. The detailed results of the percentages of macroblocks coded in SVT for different sequences in each case for all partitions and for only partition are given in [P7].

65 49 CHAPTER 7 CONCLUSION This thesis presented novel methods for improving the compression capability of video coders and reducing their implementation complexities. The goal of this thesis is to develop novel algorithms, which could be useful in video coding standards to improve the coding efficiency of the video codecs used in the future mobile video applications. In the first part of the thesis we discussed efficient fixed-point approximations of the 8x8 inverse discrete cosine transform [P1][P2][P3]. Fixed-point design methodologies and several resulting implementations of the inverse discrete cosine transform are presented. The overall architecture used in the design of the ISO/IEC IDCT algorithm, which can be characterized by its separable and scaled features, is also described. The drift analysis for integer IDCT is also given. The second part of the thesis discusses pre-scaled integer transform (PIT) [P4] which reduces the implementation complexity of the conventional integer cosine transform (ICT) and maintains all the merits such as bit-exact implementation and good coding efficiency. Design rules that lead to good PIT kernels are developed and different types of PIT and their target applications are examined. The PIT kernels used in Audio Video coding Standard (AVS), the Chinese National Coding standard, are also introduced. In the third part of the thesis we discussed a novel algorithm called spatially varying transform (SVT) [P5][P6] which improves the coding efficiency of video coders. SVT enables video coders to vary the position of the transform block, unlike state-of-art video codecs where the position of the transform block is fixed. In addition to changing the position of the transform block, the size of the transform can also be varied within the SVT framework, to better localize the prediction error so that the underlying correlations are better exploited. It is shown that by varying the position of the transform block and its size, characteristics of prediction error are better localized, and the coding efficiency is thus improved. We show that the proposed algorithm achieves 5.85% bitrate reduction compared to H.264/AVC on average over a wide range of test set. The gain in coding efficiency is achieved with a similar decoding complexity which makes the proposed algorithm easy to be incorporated in video codecs. However, the encoding complexity of SVT can be relatively high because of the need to perform a number of rate distortion optimization (RDO) steps to select the best location parameter (LP), which indicates the position of the transform. A novel low complexity algorithm operating on a macroblock and a block level is also proposed to reduce the encoding complexity of SVT [P6]. Experimental results show that the proposed low complexity algorithm can reduce the number of LPs to be tested in RDO by about 80% with only a marginal penalty in the coding efficiency.

66 50 An extension of SVT, called Prediction Signal Aided Spatially Varying Transform (PSASVT) [P7] is also presented in the thesis, which utilizes the gradient of prediction signal to eliminate the unlikely LPs. As the number of candidate LPs is reduced, a smaller number of LPs are searched by encoder, which reduces the encoding complexity. In addition, less overhead bits are needed to code the selected LP and thus the coding efficiency can be improved. Experimental results show that the number of LPs to be tested in RDO can be reduced on average by up to nearly 30%. This reduction in encoding complexity is achieved with a slight increase in coding efficiency, as the number of candidate LPs is reduced. The decoding complexity increase is only small. At the time of writing the thesis, VCEG and MPEG, the two international standardization organizations have joined and formed the Joint Collaborative Team on Video Coding (JCT-VC). JCT-VC embarked on a new project to develop the next generation video coding standard called High Efficiency Video Coding (HEVC). Many of the techniques presented in the thesis can be promising and worth further study in the HEVC framework. Especially, the proposed SVT algorithm can potentially further improve the coding efficiency of the video codecs with acceptable implementation complexity increase. It should be noted that SVT can also be potentially used in many other situations, for instance, in inter-layer prediction in scalable video coding (SVC) and inter-view prediction in multiview video coding (MVC). In addition to traditional 2-D video coding, 3D video coding can also be a new area for studying the proposed algorithm.

67 51 REFERENCES [1] ISO/IEC JTC1/SC29/WG11, Vision, Applications and Requirements for High- Performance Video Coding, MPEG Document N11096, January [2] ISO/IEC :2003, Coding of Audiovisual Objects Part 10: Advanced Video Coding, 2003, also ITU-T Recommendation H.264 Advanced video coding for generic audiovisual services. [3] I. Richardson, H.264 and MPEG4 Video Compression, Chichester, England: John Wiley & Sons, [4] N. Cho and S. Lee, Fast Algorithm and Implementation of 2-D Discrete Cosine Transform, IEEE Transactions on Circuits and Systems, vol. 38, no 3, pp , Mar [5] A. Hung and T.-Y. Meng, A Comparison of Fast DCT Algorithms, Multimedia Systems, No. 5 Vol. 2, Dec [6] L. Yu, F. Yi, J. Dong, and C. Zhang, Overview of AVS-Video: Tools, performance and complexity, in Proc. VCIP, Beijing, China, Jul. 2005, pp [7] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, T. Wedi, Video Coding with H.264 / AVC: Tools, Performance, and Complexity, IEEE Circuits and Systems Magazine, vol. 4, no. 1, pp. 7-28, April, [8] A. Robert, I. Amonou, B. Pesquet-Popescu, Improving Intra Mode Coding in H.264/AVC through Block Oriented Transforms,, in IEEE 8th Workshop on Multimedia Signal Processing, Oct. 2006, pp [9] B. Zeng and J. Fu, Directional discrete cosine transforms A new framework for image coding, IEEE Transactions on Circuits, Systems and Video Technology, vol. 18, no. 3, pp , March [10] ITU-T Recommendation H.264 Advanced video coding for generic audiovisual services, Mar

68 52 [11] A. Nosratinia, Denoising JPEG images by re-application of JPEG, in Proceedings of the 1998 IEEE Workshop on Multimedia Signal Processing (MMSP), Dec. 1998, pp [12] R. Samadani, A. Sundararajan and A. Said, Deringing and deblocking DCT compression artifacts with efficient shifted transforms, in Proceedings of the 2004 IEEE International Conference on Image Processing (ICIP), Oct. 2004, pp [13] J. Katto, J. Suzuki, S. Itagaki, S. Sakaida and K. Iguchi, Denoising intra-coded moving pictures using motion estimation and pixel shift, in Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Mar. 2008, pp [14] O. G. Guleryuz, Weighted averaging for denoising with overcomplete dictionaries, IEEE Transactions on Image Processing, vol. 16, no. 12, pp , Dec [15] S. Naito and A. Koike, Efficient coding scheme for super high definition video based on extending H.264 high profile, SPIE Visual Communications and Image Processing, Vol. 6077, pp , Jan [16] M. Wien, Variable block-size transforms for H.264/AVC, IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp , Jul [17] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, Low-complexity transform and quantization in H.264/AVC, IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp , Jul [18] S. Ma and C.-C. Kuo, High-definition video coding with super-macroblocks, SPIE Visual Communications and Image Processing, Vol. 6508, pp , Jan [19] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G. J. Sullivan, Rate-constrained coder control and comparison of video coding standards, IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp , Jul [20] P. List, A. Joch, J. Lainema, G. Bjontegaard and M. Karczewicz, Adaptive Deblocking filter, IEEE Transactions on Circuits, Systems and Video Technology, vol. 13, no. 7, pp , Jul

69 [21] K. Vermeirsch, J. De Cock, S. Notebaert, P. Lambert, and R. Van de Walle, Evaluation of transform performance when using shape-adaptive partitioning in video coding, Picture Coding Symposium (PCS), May [22] KTA reference model [online] [23] T.K. Tan, G. Sullivan and T. Wedi, Recommended simulation common conditions for coding efficiency experiments, VCEG Doc. VCEG-AE10, Jan [24] G. Bjontegaard, Calculation of average PSNR differences between RD-curves, VCEG Doc. VCEG-M33, Mar [25] S. Pateux and J. Jung, An excel add-in for computing Bjontegaard metric and its evolution, VCEG Doc. VCEG-AE07, Jan [26] F. Kamisli and J. S. Lim, Transforms for the motion compensation residual, in Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2009, pp [27] J. Xu, B. Zeng and F. Wu, An Overview of Directional Transforms in Image Coding, in IEEE International Symposium on Circuits and Systems (ISCAS), pp , [28] H. Xu, J. Xu and F. Wu, Lifting-based directional DCT-like transform for image coding, IEEE Transactions on Circuits, Systems and Video Technology, pp , Oct [29] Y. Ye and M. Karczewicz, Improved H.264 intra coding based on bi-directional intra prediction, directional transform, and adaptive coefficient scanning, in Proceedings IEEE Int. Conf. Image Processing, San Diego, USA, pp , Oct [30] M. T. Orchard and G. J. Sullivan, Overlapped block motion compensation: An estimation-theoretic approach, IEEE Trans. Image Processing, vol. 3, pp , Sept [31] H. Schiller, Prediction signal controlled scans for improved motion compensated video coding, Electronics Letters, Vol. 29, Issue. 5, pp , 1993.

70 54 [32] G. J. Sullivan and T. Wiegand, Rate-distortion optimization for video compression, IEEE Signal Processing Mag., vol. 15, pp , Nov [33] T. Wiegand and B. Girod, Lagrangian multiplier selection in hybrid video coder control, in Proc. ICIP, Thessaloniki, Greece, Oct [34] K. Ugur, K.R. Andersson, A. Fuldseth, et al, High Performance, Low Complexity Video Coding and the Emerging HEVC Coding Standard, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 20, Issue. 12, pp , Dec [35] K. Ugur, K.R. Andersson, A. Fuldseth, Video coding technology proposal by Tandberg, Nokia, and Ericsson, JCTVC-A119, Dresden, Germany, April, [36] ISO/IEC JTC1/SC29/WG11, Joint Call for Proposals on Video Compression Technology, MPEG Doc. N11113, Jan [37] Y. Liu, Z. G. Li, and Y. C. Soh, Region-of-Interest Based Re-source Allocation for Conversational Video Communication of H.264/AVC, IEEE Transactions on Circuits, Systems and Video Technology, vol. 18, Issue. 1, pp , Jan [38] W. Zheng, Y. Shishikui, M. Naemura, Y. Kanatsugu and S. Itoh, Analysis of spacedependent characteristics of motion-compensated frame differences based on a statistical motion distribution model, IEEE Trans. Image Processing, vol. 11, pp , April [39] W. K. Cham, Development of integer cosine transforms by the principle of dyadic symmetry, Proc. IEE, I, vol. 136, Issue. 4, pp , Aug, [40] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol, vol. 13, pp , July [41] G. J. Sullivan, P. N. Topiwala, and A. Luthra, The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions, in Proc. SPIE Int. Soc. Opt. Eng. 5558, Aug, [42] J. Liang, T. D. Tran, Fast multiplierless approximations of the DCT with the lifting scheme, IEEE Trans Signal Processing, vol. 49, pp , Dec 2001.

71 [43] M. Wien, S. Sun, ICT comparison for adaptive block transform, Doc. VCEG-L12, Eibsee, Germany, Jan, [44] J.L. Mitchell, W.B. Pennebaker, D. LeGall, and C. Fogg, MPEG Video Compression Standard, Chapman & Hall: New York [45] W.B. Pennebaker and J.L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold: New York [46] ANSI/IEEE , Standard Specifications for the Implementations of 8x8 Inverse Discrete Cosine Transform, December 1990 (withdrawn by ANSI 9 September, 2001; withdrawn by IEEE 7 February, 2003). [47] G.J. Sullivan, Standardization of IDCT approximation behavior for video compression: the history and the new MPEG-C parts 1 and 2 standards, in SPIE Applications of Digital Image Processing XXX, Vol. 6696, August 28-31, [48] ISO/IEC : Information technology MPEG video technologies Part 1: Accuracy requirements for implementation of integer-output 8x8 inverse discrete cosine transform, December [49] M.A. Isnardi, Description of Sample Bitstream for Testing IDCT Linearity, Moving Picture Experts Group (MPEG) document M13375, Montreux, Switzerland, April [50] C. Zhang and L. Yu, On IDCT Linearity Test, Moving Picture Experts Group (MPEG) document M13528, Klagenfurt, Austria, July [51] ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC /FPDAM1: Information technology MPEG video technologies Part 1: Accuracy requirements for implementation of integer-output 8x8 inverse discrete cosine transform. Amendment 1: Software for Integer IDCT Accuracy Testing, Moving Picture Experts Group (MPEG) output document N8981, San Jose, CA, April [52] Y. Arai, T. Agui and M. Nakajima, A Fast DCT-SQ Scheme for Images, Transactions of the IEICE E71(11): 1095, November (in Japanese)

72 56 [53] W. Chen, C.H. Smith and S.C. Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, IEEE Trans. Comm., vol. com-25, No. 9, pp , September [54] B.G. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp , May [55] M. Vetterli and A. Ligtenberg, A Discrete Fourier-Cosine Transform Chip, IEEE Journal on Selected Areas in Communications, Vol. 4, No. 1, pp , January [56] C. Loeffler, A. Ligtenberg, and G.S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications, in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. (ICASSP 89), vol. 2, pp , February 1989.

73 Publications

75 Publication 1 Yuriy A. Reznik, Arianne T. Hinds, Cixun Zhang, Lu Yu, Zhibo Ni, Efficient Fixed-Point Approximations of 8x8 Inverse Discrete Cosine Transform Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp , San Diego, 28 Aug

77 Invited Paper Efficient Fixed-Point Approximations of the 8x8 Inverse Discrete Cosine Transform Yuriy A. Reznik, Arianne T. Hinds,CixunZhang,LuYu,andZhiboNi QUALCOMM Incorporated, 5775 Morehouse Dr., San Diego, CA 92121; USA IBM, Incorporated, 6300 Diagonal Hwy, Boulder, CO 80301; USA Zhejiang University, Hangzhou, Zhejiang, ; China ABSTRACT This paper describes fixed-point design methodologies and several resulting implementations of the Inverse Discrete Cosine Transform (IDCT) contributed by the authors to MPEG s work on defining the new 8x8 fixed point IDCT standard ISO/IEC The algorithm currently specified in the Final Committee Draft (FCD) of this standard is also described herein. Keywords: DCT, IDCT, factorizations, diophantine approximations, multiplierless algorithms, MPEG, JPEG, ITU-T, H.261, H.263, MPEG-1, MPEG-2, MPEG-4 1. INTRODUCTION The Discrete Cosine Transform (DCT) 1 is a fundamental operation used by the vast majority of today s image and video compression standards, such as JPEG, MPEG-1, MPEG-2, MPEG-4 (P.2), H.261, H.263, and others 2. 8 Encoders in these standards apply such transforms to each 8x8 block of pixels to produce DCT coefficients, which are then subject to quantization and encoding. The Inverse Discrete Cosine Transform (IDCT) is used in both the encoder and decoder to convert DCT coefficients back to the spatial domain. At the time when the first image and video compression standards were defined, the implementation of DCT and IDCT algorithms was considered a major technical challenge, and therefore, instead of defining a specific algorithm for computing it, ITU-T H.261, JPEG, and MPEG standards have included precision specifications that must be met by IDCT implementations conforming to these standards. 9 This decision has allowed manufacturers to use the best optimized designs for their respective platforms. However, the drawback of this approach is the impossibility to guarantee exact decoding of MPEG-encoded videos on across different decoder implementations. In early 2005, prompted by the expiration and withdrawal of a related IDCT precision specification (IEEE Standard ), MPEG has decided to improve its handling of this matter, and produce a new precision specification ISO/IEC , replacing the IEEE 1180 standard, and harmonizing the treatment of precision requirements across different MPEG standards, 10 and a new voluntary standard, ISO/IEC , providing specific (deterministically defined) examples of fixed-point 8x8 IDCT and DCT implementations. The Call for Proposals 11 for the ISO/IEC standard was issued at the August 2005 meeting in Poznań, Poland, and during several subsequent meetings MPEG has received a number of contributions with specific algorithm proposals, as well as general optimization techniques, analyses of different factorizations, drift problem, implementation studies, and other informative documents. Many of the successive submissions have benefited from earlier ideas contributed to MPEG by other proponents, converging on key design aspects. 10 The Further author information: (please send correspondence to Yuriy A. Reznik) Yuriy A. Reznik: yreznik@ieee.org, phone: +1 (858) , Arianne T. Hinds: arianne@us.ibm.com, Cixun Zhang: cixunzhang@hotmail.com, Lu Yu: yul@zju.edu.cn, Zhibo Ni: hzjimmy@hotmail.com. The work of Lu Yu, Cixun Zhang, and Zhibo Ni was supported by NSFC under contract No and Project of Science and Technology Plan of Zhejiang Province under contract No. 2005C Applications of Digital Image Processing XXX, edited by Andrew G. Tescher Proc. of SPIE Vol. 6696, , (2007) X/07/$18 doi: / Proc. of SPIE Vol

78 Committee Draft (CD) of this standard, containing a single algorithm was issued at the October 2006 meeting in Hangzhou, China. The Final Committee Draft (FCD) of this standard was reached at the April 2007 meeting in San Jose, CA, 13 and the Final Draft International Standard (FDIS) is now expected to be issued in October of This paper describes fixed-point design methodologies and several resulting IDCT implementations contributed by the authors to this MPEG project. The algorithm currently specified in the Final Committee Draft (FCD) of this standard is also described. Our paper is organized as follows. In Section 2, we provide background information, including definitions of the DCT and IDCT, examples of their factorizations, and review of some basic techniques used for their fixed-point implementations. In Section 2, we also explain several ideas that we have proposed for improving performance of fixed-point designs: introduction of floating factors between sub-transforms, the use of fast algorithms for computation of products by groups of factors, and techniques for minimizing rounding errors in algorithms using right shift operations. In Section 3, we show how these techniques were applied to design our proposed IDCT approximations. Finally, in Section 4, we provide a detailed description of the algorithm in the FCD of the ISO/IEC standard. Appendices A and B contain supplemental information and proofs of our claims. 2. BACKGROUND INFORMATION & MAIN IDEAS USED IN THIS WORK 2.1 Definitions The order-8, one-dimensional (1D) type II 1 Discrete Cosine Transform (DCT), and its corresponding Inverse DCT (IDCT) are defined as follows: F u = c u 2 f x = 7 f x cos x=0 7 c u 2 F u cos u=0 (2x +1)uπ 16 (2x +1)uπ 16 where c u =1/ 2, when u =0,andc u =1otherwise. The definitions for the two-dimensional (2D) versions of these transforms are: F vu = c uc v 4 f yx = 7 7 x=0 y=0 7 u=0 v=0 7 f yx cos (2x +1)uπ 16 cos, u =0,...,7, (1), x =0,...,7, (2) (2y +1)vπ 16, v,u =0,...,7, (3) c u c v 4 F (2x +1)uπ (2y +1)vπ vu cos cos, y,x =0,...,7, (4) where f yx (y, x =0...7) denote input spatial domain values (for image and video coding values of pixels or prediction residuals), and F vu (u, v =0...7) denote transform domain values, or transform coefficients. When used in image or video coding applications, the pixel values are normally assumed to be in the range of [ 256, 255], while the transform coefficients are in the range of [ 2048, 2047]. Mathematically, these are linear, orthogonal, and separable transforms. That is, a 2D transform can be decomposed into a cascade of 1D transforms applied successively to all rows and then to all columns in the matrix. This property of separability is often exploited by implementors to mitigate the complexity of an entire 2D transform into a much simpler set of 1D operations. Proc. of SPIE Vol

79 2.2 Precision Requirements for IDCT Implementations in MPEG and ITU-T Standards As described previously, specifications of MPEG, JPEG, and several ITU-T video coding standards, do not require IDCTs to be implemented exactly as specified in (4). Rather, they require practical IDCT implementations to produce integer output values f yx that fall within certain specified tolerances from outputs of an ideal IDCT rounded to the nearest integer: ˆf yx = f yx +1/2. (5) The precise specification of these error tolerances, and how they are to be measured for a given IDCT implementation under test, is defined by the MPEG IDCT precision standard: ISO/IEC This standard includes several tests using a pseudo-random input generator originating from the former IEEE Standard , 9 as well as additional tests required by MPEG standards. A summary of the error metrics defined by the IEEE 1180 ISO/IEC specification is provided in Table 1. Here, the variable i =1,...,Q indicates the index of a pseudo random input 8x8 matrix used in a test, and Q = (or in some tests, Q = ) denotes the total number of sample matrices. The tolerance for each metric is provided in the last column of this table. Table 1. The IEEE 1180 ISO/IEC pseudo-random IDCT test metrics and conditions Error metric Description Test condition p = max y,x,i ˆf yx i f yx i Peak pixel error p 1 d yx = 1 ˆf Q i yx i f yx i Pixel mean error max y,x d yx m = 1 64 y,x d yx Overall mean error m e yx = 1 Q i ( ˆf yx i f yx i )2 Mean square error max y,x e yx 0.06 n = 1 64 y,x e yx Overall mean square error n 0.02 It should be noted that IEEE 1180 ISO/IEC error metrics satisfy the following chain of inequalities max e yx max d yx m, max e yx n (6) which implies that max e yx or peak mean square error (pmse) metric is the strongest one in this set. Among the additional (informative) tests provided in the ISO/IEC /FPDAM1 specification, 15 the so-called linearity test is of notable interest. This test requires the reconstructed pixel values produced by an IDCT implementation under test f yx to be symmetric with respect to the sign reversal of its input coefficients: f yx ( F vu )= f yx (F vu ), F vu = [ z, v = t, u= s 0, otherwise, z =1, 3,...,527, v, u, t, s =0,...,7. (7) This test was motivated by the observation that in the decoding of static regions within consecutive video frames, the decoder will reconstruct small (zero-mean, symmetrically distributed) differences, that will normally negate each other over time (across the sequence of frames). If the IDCT implementation does not satisfy this property (7), then the mismatch between IDCT outputs may instead accumulate, eventually producing a 16, 17 remarkable visible degradation in the quality of the reconstructed video. 2.3 DCT/IDCT Factorizations Much of the original research for designing fast implementations of DCT transforms was focused on finding DCT factorizations, resulting in the minimum number of multiplications by irrational factors. Many factorizations have been derived by utilizing other known fast algorithms, such as the classic Cooley-Tukey FFT algorithm, or by applying systematic approaches, such as a decimation in time, or a decimation in frequency. 1 The formal Considering general definition of linearity f(αx +βy) =αf(x)+βf(y), for some operator f(.), this test only considers acasewhenα =0,andβ = 1. Therefore, this is rather a sign-symmetry test. Proc. of SPIE Vol

80 setting of this problem and an upper bound for the multiplicative complexity of transforms of orders 2 n can be found in E. Feig and S. Winograd. 20 In one special case of the order-8 two-dimensional DCT/IDCT, the least complex direct 2D factorization is described by E. Feig and S. Winograd. 21 Their implementation requires 96 multiplication and 454 addition operations for the computation of the complete set of 2D outputs. The same paper further describes an efficient scaled 8x8 DCT implementation, that requires only 54 multiplication, 462 addition, and 6 shift operations. 21 The latter of these transforms is refered to as a scaled transform because all of its outputs must be scaled (i.e. multiplied by fixed, possibly irrational, constants) so that each output will equate to the relative output of a nonscaled DCT. In some applications, such as JPEG, and several video coding algorithms, this process of scaling can be implemented jointly with the process of quantization (by factoring together the scale constants with the corresponding quantization values), thereby resulting in significant computational savings. Some of the most efficient and well-known 1D DCT factorizations include the scaled factorization of Y. Arai, T. Agui and M. Nakajima (AAN), 25 (only 5 multiplication and 29 addition operations), and the non-scaled factorizations of W. Chen, C.H. Smith and S.C. Fralick, 23 B.G. Lee, 24 M. Vetterli and A. Ligtenberg (VL), 26 and C. Loeffler, A. Ligtenberg and G. Moschytz (LLM). 27 The VL and LLM algorithms are the least complex among known non-scaled designs, and require only 11 multiplication and 26 addition operations. We note that the suitability of each of these factorizations to the design of fixed point IDCT algorithms has been extensively analyzed in the course of work for the ISO/IEC standard Fixed-Point Approximations As described previously, implementations of the DCT/IDCT require multiplication operations with irrational constants (i.e. the cosines). Clever factorizations can only reduce the number of such essential multiplications, but not eliminate them altogether. Hence, in the design of implementations of the DCT/IDCT, one is usually tasked with finding ways of approximately computing products of these irrational factors by using fixed-point arithmetic. One of the most common and practical techniques for converting floating-point to fixed-point values is based on the approximations of irrational factors α i by dyadic fractions: α i a i /2 k, (8) where both a i and k are integers. In this way, multiplication of x by factor α i permits the implementation of a very simple approximation in integer arithmetic as follows: xα i (x a i ) >> k ; (9) where >> denotes the bit-wise right shift operation. In some transform designs, right shift operations in approximations (9) can be delayed to later stages of the implementation, or done at the very end of the transform, but the more complex operations, such as multiplications for each non-trivial constant α i still need to be performed in the algorithm. The key variable that affects the precision and complexity of these dyadic rational approximations (8) is the number of precision bits k. In software designs, this parameter is often constrained by the width of registers (e.g. 16 or 32) and the consequence of not satisfying such a design constraint can easily result in the doubling of execution time for the transform. In hardware designs, the parameter k affects the number of gates needed to implement adders and multipliers. Hence, one of the basic goals in fixed point designs is to minimize the total number of bits k, while maintaining sufficient accuracy of approximations. 2.5 Improving Precision of Dyadic Rational Approximations Without placing any specific constraints on values for α i, and assuming that for any given k, the corresponding values of nominators a i are chosen such that: αi a i /2 k =2 k 2 k α i a i =2 k min 2 k α i z, z Z Proc. of SPIE Vol

81 we can conclude that the absolute error of approximations in (8) should be inversely proportional to 2 k : αi a i /2 k 2 k 1. That is, each extra bit of precision (i.e. incrementing k), should reduce the error by half. Nevertheless, it turns out that this rate can be significantly improved if the values α 1,...,α n that we are trying to approximate can be simultaneously scaled by some additional parameter ξ. We claim the following (the proof for which is provided in Appendix A): Lemma 2.1. Let α 1,...,α n be a set of n irrational numbers (n 2). Then, there exist infinitely many n +2- tuples a 1,...,a n,k,ξ,witha 1,...,a n Z, k N, andξ Q, such that max { ξα1 a 1 /2 k,..., ξαn a n /2 k } < n n +1 ξ 1/n 2 k(1+1/n). (10) In other words, if the algorithm can be altered such that all of its irrational factors α 1,...,α n can be prescaled by some parameter ξ, then we might be able to find approximations whose absolute error decreases as fast as 2 k(1+1/n). For example, when n=2, this means 50% higher effectiveness in the usage of bits. For large sets of factors α 1,...,α n, however, this gain will be smaller. These observations suggest that we can significantly improve the precision of a fixed-point IDCT design by splitting it into a set of smaller blocks (or sub-transforms) with alterable common factors, and then adjust these factors such that they yield high-accuracy solutions predicted by Lemma Reducing Complexity of Multiplications The dyadic approximations shown in (8, 9) already reduce the problem of computing products by irrational constants to multiplications by integers. However, integer multiplications can still be computationally expensive to use on many existing platforms, and in such cases it becomes desirable to find ways to compute these products without using general purpose multipliers. To illustrate this idea, consider a multiplication by an irrational factor 1/ 2, using its 5-bit dyadic approximation: 23/32. By looking at the binary bit pattern of 23 = and substituting each 1 with an addition operation, we can compute a product by 23 as follows: x 23 = (x <<4) + (x <<2) + (x <<1) + x. This approximation requires 3 addition and 3 shift operations. By further noting that the last 3 digits form a series of 1 s, we can instead use: x 23 = (x <<4) + (x <<3) x. (11) which reduces the complexity to just 2 shift and 2 addition operations. In engineering literature, the sequences of operations + associated with isolated digits 1, or + and - associated with beginnings and ends of runs are commonly referred to as a Canonical Signed Digit (CSD) decomposition. 34 This is a well known and frequently used tool in the design of multiplierless circuits. 38 However, CSD decompositions do not always produce results with the lowest numbers of operations. For example, considering an 8-bit approximation of the same factor 1/ 2 181/256 = , we find that its CSD decomposition: x 181 = (x <<7) + (x <<5) + (x <<4) + (x <<2) + x, needs 4 addition and 4 shift operations. But, by rearranging the computations and reusing intermediate results, a more efficient algorithm can be constructed: x2 = x +(x<<2); // 101 x3 = x2+(x<<4); // x4 = x3+(x2 << 5); // = x 181 Proc. of SPIE Vol

82 x x*23/32 (x>>1) + (x>>2) - (x>>5) x - (x>>2) - (x>>5) x + ((-x) >>6) - ((x + (x>>4)) >> 2) Figure 1. Errors in multiplierless implementations of a product by 23/32. This approximation requires only 3 addition and 3 shift operations. An even more dramatic reduction in complexity can be achieved by performing joint factorization of simultaneous products by multiple integer constants. For example, consider the task of computing products by two constants: 99 = and 239 = The use of CSD decompositions x 99 = (x <<6) + (x <<5) + (x <<1) + x; x 239 = (x <<8) (x <<4) x; results in a total complexity of 5 addition and 5 shift operations. At the same time, by using a joint factorization of these two products, the same task can be simplified by the following implementation: x2 = x +(x<<5); // x3 = x2 << 2; // x4 = x3 x2; // = x 99 x5 = x3+x4+(x<<3); // = x 239 which needs only 4 addition and 3 shift operations. In the context of the IDCT design, such algorithms can be used for simultaneous computation of products by pairs of factors in transform butterflies. Moreover, since in each butterfly there are typically two variables that need to be multiplied by the same factors, such computations can easily be done in parallel. In passing, we should note that finding optimal (i.e. with fewest numbers of additions and/or shifts) algorithms for computing multiplications by integer constants has been an area of active and fruitful research during the last few decades It has been established that this problem is NP-complete, 36 and numerous 34, 37, 39, 40 fast heuristic algorithms have been proposed for solving it approximately. 2.7 Minimizing Errors in Multiplierless Algorithms using Right Shifts Another family of techniques for computation of products by dyadic fractions (8) can be derived by allowing the use of right shifts as elementary operations. Forexample,consideringafactor1/ 2 23/32 = , and using right shift and addition operations according to its CSD decomposition, we obtain : x 23/32 (x >>1) + (x >>2) (x >>5). (12) Hereafter, by writing a(x) b(x) for some functions a(.) andb(.), we imply that there exists a constant δ 0, such that for all x: a(x) b(x) δ. Proc. of SPIE Vol

83 or (by further noting that 1/2+1/4 =1 1/4): x 23/32 x (x >>2) (x >>5). (13) Yet another (although, somewhat less obvious) way of computing product by the same factor is: x 23/32 x ((x +(x>>4)) >> 2) + (( x) >> 6). (14) We present plots of values produced by these algorithms in Figure 1. It can be noted that they all compute values that approximate products by fraction 23/32, however, the errors in each of these approximations are different. For example, the algorithm (13) produces all positive errors, with a maximum magnitude of 55/32. The algorithm (12) has more balanced errors, with the magnitude of oscillations within ±65/64. Finally, the algorithm (14) produces perfectly sign-symmetric errors with oscillations in ±7/8. The sign-symmetry property of an algorithm A ai,b(x) xa i /2 b means that for any (x Z): and it also implies that for any N A ai,b( x) = A ai,b(x), (15) N x= N [ A ai,b(x) x a ] i =0, (16) b that is, a zero-mean error on any symmetric interval. This property is very important in the design of signal processing algorithms, as it minimizes the probability that rounding errors introduced by fixed-point approximations will accumulate. Below we will establish the existence of right-shift-based sign-symmetric algorithms for computing products by dyadic fractions and provide upper bounds for their complexity. Given a set of dyadic fractions a 1 /2 b,...,a m /2 b, we define an algorithm A ai,...,a m,b(x) ( xa 1 /2 b,...,xa m /2 b) (17) as the following sequence of steps: x 1,x 2,...,x t, (18) where x 1 := x, and where subsequent values x k (k = 2,...,t) are produced by using one of the following elementary operations: x i >> s k ; 1 i<k, s k 1; or x k := x i ; 1 i<k; or (19) x i + x j ; 1 i, j < k; or x i x j ; 1 i, j < k, i j. The algorithm terminates when there exists indices j 1,...,j m t, such that: x j1 x a 1 /2 b,...,x jm x a m /2 b. (20) We state the following (the proofs for which are provided in Appendix B): Theorem 2.2. For any m, b N and a i,...,a m Z, there exist algorithms A ai,...,a m,b (17-20), which are sign-symmetric. That is, for any x Z: A ai,...,a m,b( x) = A ai,...,a m,b(x). Theorem 2.3. The lowest possible number of shifts in algorithms (17-20) satisfying the sign-symmetry property is at most twice the lowest possible number of shifts in algorithms without this property. Theorem 2.4. The lowest possible total number of instructions in algorithms (17-20) satisfying the signsymmetry property is at most four times the lowest possible total number of instructions in algorithms without this property. Proc. of SPIE Vol

84 12 Pre-scaling of transform coefficients: 12+P F vu = (F vu* S vu) >> (S-P); Add DC bias: F 00 += 1<<(P+2); 12+P 1D row transforms 14+P 1D column transforms 14+P Right shifts f yx >>= P+3; 11 Figure 2. Fixed-point 8x8 IDCT architecture. X 0 x 0 X 0 x 0 X 4 x 1 X 4 x 1 X 2 X 6 X 7 X 3 X 5 X cos 3 /8 2sin 3 /8 cos /16 sin /16 cos 3 /16 sin 3 /16 x 2 x 3 x 4 x 5 x 6 x 7 1 X 2 1 X 6 1 X 7 X 3 X 5 1 X 1 Figure 3. Loeffler-Ligtenberg-Moschytz IDCT factorization (left), and its generalized scaled form (right). Parameters (ξ, ζ) represent additional floating factors that we use for finding efficient fixed-point approximations. x 2 x 3 x 4 x 5 x 6 x 7 We should point out that these are very simple and rather coarse complexity bounds, and that in many cases, the complexity overhead for achieving sign-symmetry is not that high. Moreover, when complexity considerations are paramount, one can pick algorithms that are sign-symmetric for most, but not all values of x in the expected range of this variable. In many cases, such almost symmetric algorithms can also be the least complex for a given set of factors. In the design of our IDCTs we have used an exhaustive enumeration process for searching for the best algorithms (17-20) with symmetric (or at least well-balanced) rounding errors. As additional criteria for selection of such algorithms, we have used estimates of mean, variance, and magnitude (maximum values) of errors that they produce. In assessing their complexity, we have counted the numbers of operations, as well as the longest execution path, and maximum number of intermediate registers needed for computations. 3. DESIGN OF FIXED-POINT APPROXIMATIONS OF THE 8X8 IDCT The overall architecture used in the design of the proposed fixed-point IDCT algorithms is shown in Figure 2, which can be characterized by its separable and scaled features. The scaling stage is performed with a single 8x8 matrix that is precomputed by factoring the 1D scale factors for the row transform with the 1D scale factors for the column transform. The scaling stage is also used to pre-allocate P bits of precision to each of the input DCT coefficients thereby providing a fixed-point mantissa for use throughout the rest of the transform. Other key features of this architecture include simplicity, compactness / cache-efficiency, and flexibility of its interface, by allowing the potential for merging of scaling and quantization logic in video and image codec implementations. As the underlying basis for scaled 1D transform design, we use a variant of the well-known factorization of C. Loeffler, A. Ligtenberg, and G.S. Moschytz 27 with 3 planar rotations and 2 independent factors γ = 2(see Figure 3). This choice has been made empirically based on an extensive analysis of fixed-point designs derived from other known algorithms, including variants of AAN, 25 VL, 26 and LLM 27 factorizations. 31 In order to allow efficient rational approximations of constants α, β, δ, ɛ, η, andθ within the LLM factorization, we introduce two floating factors ξ and ζ, and apply them to two sub-groups of these constants as follows (see also Figure 3, right flowgraph): ξ : α = ξα, β = ξβ; ζ : δ = ζδ, ɛ = ζɛ, η = ζη, θ = ζθ; (21) Proc. of SPIE Vol

85 We invert these multiplications by ξ and ζ in the scaling stage by multiplying each input DCT coefficient with the respective reciprocal of ξ and ζ. That is, we pre-compute a vector of scale factors for use in the scaling stage prior to the first in the cascade of 1D transforms. σ =(1, 1/ζ, 1/ξ, γ/ζ, 1,γ/ζ,1/ξ, 1/ζ) T. (22) These factors are subsequently merged into a scaling matrix which is precomputed as follows: A B C D A D C B B E F G B G F E C F H I C I H F Σ=σσ T 2 S = D G I J D J I G A B C D A D C B D G I J D J I G C F H I C I H F B E F G B G F E (23) where A J denote unique values in this product: A =2 S, B = 2S ζ, C = 2S ξ, D = γ2s 2S,E= ζ ζ 2, F = 2S ξζ, G = γ2s 2S ζ 2,H= ξ 2, I = γ2s ξζ,j= γ2 2 S ζ 2, and S denotes the number of fixed-point precision bits allocated for scaling. This parameter S is chosen such that it is greater than or equal to the number of bits P for the mantissa of each input coefficient. This allows scaling of the coefficients F vu, to be implemented as follows: F vu =(F vu S vu ) >> (S P ), (24) where S vu Σ vu denote integer approximations of values in matrix of scalefactors (23). At the end of the last transform in the series of 1D transforms, the P fixed-point mantissa bits (plus 3 extra bits accumulated during executions of each of the 1D stages ) are simply shifted out of the transform outputs by right shift operations: f yx = f yx >> (P +3). (25) To ensure a proper rounding of the computed value in (25), we add a bias of 2 P +2 to the values f yx prior to the shifts. This rounding bias is implemented by perturbing the DC coefficient prior to executing the first 1D transform: F 00 = F 00 +2P +2. Using this architecture, the task of finding fixed point IDCT implementations is now reduced to finding sets of integer approximations of factors A, B, C, D, E, F, G, I, J the coefficients in the matrix of scale factors (23), and α,β,δ,ɛ,η,θ the factors inside the 1D transforms and algorithms for computing products by them. Global parameters that can also be adjusted are: P the number of fixed-point mantissa bits; S the number of bits used to implement the scaling stage such that S P ; k the number of bits used for the approximations of factors within 1D transforms. The LLM factorization naturally causes all quantities on the output to be multiplied by a factor of This results in 1.5 bits mantissa expansion during row- and column- passes, and 3 bits accumulated at the end of the 2D transform. Proc. of SPIE Vol

86 Table 2. Fixed-Point IDCT Approximations Derived by Using our Design Framework. Algorithm Z0a L16 L1 L0 L2 Z1 Z4 Details A B C D D scale E factors F G H I J α 519/ / / /128 41/ / /16384 β 413/ /32 45/256 45/256 99/128 67/ /32768 Transform δ 55/ / / / / /16384 factors ɛ 147/ /256 1/2 1/2 719/ / /8192 η 99/256 13/32 111/256 29/ / / /32768 θ 239/ /256 67/64 35/32 1/2 21/ /8192 S Bit- P usage k D 50a, 20s 44a, 18s 42a, 20s 48a, 12s 44a, 20s 48a, 26s 56a, 32s Complexity 2D 801a, 384s 705a, 352s 673a, 384s 769a, 256s 705a, 384s 769a, 480s 897a, 576s 2D+S 1017a, 624s 901a, 552s 829a, 576s 925a, 448s 865a, 576s 925a, 680s 1125a, 792s p max e yx Precision n max d yx m Linearity test fail fail fail fail pass pass pass Ext. dynamic range test fail fail fail fail pass pass pass Notably, this list of parameters does not include the values for our floating factors ξ and ζ. The reason for their exclusion is that these factors are needed only for establishing the relationship between the values of the factors inside the transform (21) and the values for the scale factors (22). The actual values for ξ and ζ are absorbed by the rational fractions assigned to each factor. 30, 31 This design framework has been used for the design of several IDCT approximations submitted to MPEG. The search for the above parameters and algorithms has been organized such that for each candidate transform approximation we were able to measure: (a) the IDCT precision in terms of accuracy metrics, and (b) the number of operations needed for its implementation. This approach allowed us to identify transforms with the best achievable complexity and precision tradeoffs. 3.1 Examples of IDCT Designs We summarize the values of parameters and performance characteristics of several algorithms designed using this framework in Table 2. These algorithms have the following particular features: L16 an algorithm passing all normative ISO/IEC precision tests using the lowest achievable number of mantissa bits: P = 3. This implies that this algorithm is implementable on 16-bit platforms. L1 an ISO/IEC compliant IDCT approximation with the lowest achievable number of bits in approximations of transform factors: k =8. Proc. of SPIE Vol

87 Codec: MPEG-2, quant_scale=1, W[I,j]=16, no I-MBs Codec: H.263, QP=1, Annex T, no I-MBs PSNR(Y) Ref. IDCT L16 L1 L0 L2 Z0a Z1 Z4 XVID MPEG2 TM5 H263 Ann.W PSNR(Y) Ref. IDCT L16 L1 L0 L2 Z0a Z1 Z4 XVID MPEG2 TM5 H263 Ann.W Frame # Frame # Figure 4. Drift performance of our IDCT approximations using sequence News (CIF, 300 frames). L0 an ISO/IEC compliant IDCT approximation with the lowest achievable number of additions: 42 additions per 1D transform. Since the underlying factorization contains 26 additions + 12 multiplications, this means that each multiplication in algorithm L0 is implemented by using only additions. L2 an ISO/IEC compliant IDCT approximation with the lowest achievable number of shifts (12 shifts per 1D transform). Since the underlying factorization contains 12 multiplications, this means that each multiplication in algorithm L2 is implemented by using only 1 shift operation. Z0a a higher-accuracy (linearity-test compliant) algorithm, selected for the Final Committee Draft (FCD) of the ISO/IEC standard. 13 Z1 an algorithm that was originally selected for the Committee Draft (CD) of ISO/IEC standard. 12 This algorithm is considerably more complex than the FCD design (Z0a). Z4 an ultra-high precision IDCT approximation. In characterizing IDCT precision, Table 2 lists worst-case values of ISO/IEC metrics, collected over all normative pseudo-random tests. 14 In describing complexity, letters a are used to denote the numbers of additions and letters s to denote the numbers of shifts necessary to implement these algorithms. The 1D complexity section provides the numbers of operations necessary to implement each scaled one-dimensional transform. The 2D complexity section shows the total numbers of operations necessary to implement the scaled 2D transform. Finally, the 2D+S complexity section shows the total numbers of operations necessary to implement the complete 2D IDCT transform, including scaling (assuming that all input coefficients are non-zero). The collection of algorithms L16, L0, L1, and L2 illustrates extremes that can be reached if the goal is to simply pass the basic set of precision requirements for IDCT implementations in MPEG standards. Algorithms Z0a, Z1, and Z4 strive to go beyond this basic goal and have some nice additional properties. For example, they all pass the linearity test, 16, 17 pass extended dynamic range tests, 15 and perform better in so-called IDCT-drift tests described in the next section. 3.2 Drift Performance Analysis The IEEE 1180 ISO/IEC tests define mandatory requirements for IDCT implementations in MPEG and ITU-T video coding standards. However, passing them does not always guarantee high quality of the decoded video, particularly in situations with low quantization noise and long runs of predicted (P-type) frames or macroblocks. 42 This is why, in evaluating an IDCT design, it is important to use additional tests, such as those measuring drift (difference between reconstructed video frames in encoder and decoder) caused by the use of this approximate IDCT design in the decoder. Proc. of SPIE Vol

88 In order to measure the drift performance of our IDCTs we have used reference software encoders (employing floating-point DCTs and IDCTs) of H.263, MPEG-2, and MPEG-4 P2 standards. In order to emphasize IDCT drift effects, we have also: forced all frames after the first one to be P-frames; disabled Intra-macroblock refreshes; forced QP =1(quant scale =1,andw[i, j] = 16 in MPEG 2,4) for all frames; In the decoder we have used our IDCT approximations, and for comparison, we have also run tests for the following existing IDCT implementations: MPEG-2 TM5 IDCT - fixed-point implementation included in MPEG-2 reference software, 43 XVID IDCT - a high-accuracy fixed-point implementation of IDCT in XVID (MPEG-4 P2) codec, 44 and H.263 Annex W IDCT - a 16-bit IDCT algorithm specified in Annex W of ITU-T Recommendation H The results of our tests for sequence News, using H.263 and MPEG-2 codecs, are shown in Figure 4. It can be observed, that the high-precision algorithm Z4 has virtually no drift. Then algorithms Z1 and Z0a follow with their worst case accumulated drift contained approximately within 0.5dB in H.263 tests, and within 2dB in MPEG-2 tests. Algorithms L0, L2, L1, then follow with their worst case drift being slightly worse (approximately 0.625dBin H.263and2.25dBin MPEG-2tests). Therestofthe algorithms, however, performmuchworse. The MPEG-2 TM5 and XVID implementations show approximately 3dB drift in H.263 test, and almost 12dB drift in the MPEG-2 environment. Even worse is the drift behavior of the 16-bit algorithms in our tests L16 and H.263 Annex W: they both show approximately 4dB drift in H.263 test, and 18-20dB drift in the MPEG-2 test. These results illustrate that IDCT drift performance can be significantly affected by the choice of the fixedpoint architecture, and its parameters. In particular, in testing numerous implementations produced using our scaled, LLM-based framework, we have observed that drift performance is most significantly affected by our mantissa parameter P. For the majority of algorithms: L0, L1, L2, Z0a, and Z1, reducing the mantissa by 2, 3, sometimes even by 4 bits had almost no effect on most of the IEEE 1180 ISO/IEC precision metrics, andyet,eachsuchbithadamajoreffect(about1-2dbper bit difference) in drift tests. The algorithm L16 is an extreme example of such a mantissa reduction process (leaving only P = 3), and it is obviously unacceptable in terms of drift performance. For this reason, we have retained at least P = 10 bits of mantissa in the design of most of our algorithms proposed to MPEG. 4. THE ISO/IEC FCD FIXED POINT IDCT ALGORITHM The overall architecture and 1D factorization flowgraph used by ISO/IEC FCD algorithm are depicted in Figure 2 and Figure 3 correspondingly. All integer factors and parameters used in this algorithm are listed in Table 2 under the column This transform allocates P = 10 bits for the fixed-point mantissa, and uses the same number of bits for specifying the scale factors S = 10. This cancels out right shifts in the processing of input coefficients (24), and makes the scaling stage of this transform particularly simple: F vu = F vu S vu, v,u =0,...,7, (26) F 00 = F , (27) where F vu are input coefficients, and the DC-term adjustment (27) is done to ensure proper rounding at the end of the transform: f yx = f yx >> 13, y,x =0,...,7. Proc. of SPIE Vol

89 Scaling, rounding and righ-shifting of transform 9 6+P 9+P 12+P 12 Left shifts 1D column 1D row coefficients: fyx << P-3; transforms transforms Fvu = (F vu* Svu + 2 P+S sgn(f vu)) >> (P+S); Figure 5. ISO/IEC FDCT architecture. x 0 X 0 x 1 X 4 x 2 x 3 x 4 x 5 x 6 x 7 1 X 2 1 X 6 1 X 7 X 3 X 5 1 X 1 Figure 6. Factorization employed in ISO/IEC FDCT design. The maximum total number of bits needed by all variables in this transform is 26 bits, which assumes full 12-bit dynamic range of reconstructed pixel values, which is sufficient to cover even extreme cases of quantization noise expansion, as described in 41. There are three groups of rational dyadic factors processed by this algorithm (see Figure 3, and Table. 2): α =41/128 and β =99/128 in the butterfly with coefficients X 2 and X 6, δ = 113/128 and ɛ = 719/4096 in the butterfly with coefficients X 3 and X 5,and η = 1533/2048 and θ =1/2 in the butterfly with coefficients X 1 and X 7. The computation of products by these factors is performed as follows: x2 = x +(x>>5); // x3 = x2 >> 2; // x4 = x3+(x>>4); // x 41/128 = x α x5 = x2 x3; // x 99/128 = x β x2 = (x>>3) (x >>7); // x3 = x2 (x >>11); // x4 = x2+(x3 >> 1); // x 719/4096 = x ɛ x5 = x x2; // x 113/256 = x δ x2 = (x>>9) x; // x3 = x>>1; // 0.1 x/2 =x θ x4 = (x2 >> 2) x2; // x 1533/2046 = x η The combined complexity of all these operations is only 9 addition and 10 shift operations. Therefore, the average complexity for computing a single multiplication in this algorithm is only 9/6 = 1.5 addition and 10/ shift operations. In comparing this with traditional fixed point-point implementation of products by factors: x η (x ) >> 11 which includes an addition (for proper rounding) and a shift, we conclude that the effective cost of each integer multiplication in ISO/IEC FCD algorithm is only 0.5 addition shift operations. Proc. of SPIE Vol

90 The total complexity of computing each scaled 1D transform in this algorithm is 44 addition and 20 shift operations. The description of a complete 1D transform in C programming language requires only 50 lines. 13 Extra C-code needed to describe the full 2D version takes only 20 lines. The scaling of transform coefficients can be done either outside of the transform, e.g. in the quantization stage, thereby taking advantage of the sparseness of the input matrix of coefficients, or inside the transform, by executing multiplications (26). This algorithm passes all normative ISO/IEC precision tests, 14 as well as many additional tests that have been created in the process of evaluating fixed point designs in MPEG. These additional tests include MPEG-2 and MPEG-4, and T.83 (JPEG) conformance tests, drift tests with H.263, MPEG-2, and MPEG-4 encoders and decoders, as well as linearity test, and extended dynamic range tests ISO/IEC FCD FDCT Design The design of the corresponding fixed-point forward ISO/IEC DCT is fully symmetric relative to the IDCT design. Its overall architecture and 1D factorization are presented in Figure 5 and Figure 6 correspondingly. All integer factors and algorithms for computing products in this FDCT design are exactly the same as in the IDCT, with the only difference being simply the order in which they are executed. The two elements in the FDCT design that are implemented differently when compared to the IDCT design are: the reservation of mantissa bits and scaling. The allocation of mantissa bits is done at the very beginning of the FDCT transform as follows: and the scaling is done at the very end, by using f yx = f yx << 7, y,x =0,...,7, (28) F vu = ( F vu S vu sgn(F vu) ) >> 20, v,u =0,...,7, (29) where [ 0, if x 0 sgn(x) = 1, if x<0. (30) The use of the term (30) in rounding (29) assures that FDCT scaling is done in a sign-symmetric fashion, with a slightly wider deadzone around 0. We note that the scaled architecture of ISO/IEC FDCT design makes it also possible to combine the final scaling stage (29-30) with the quantization process in video or image encoders, thereby enabling further complexity reductions. 5. CONCLUSIONS In this paper we have described our proposed fixed-point IDCT design methodologies and several resulting algorithms achieving different precision/complexity characteristics. We have explained choices of the parameters in such designs, and their connection to IDCT precision and drift performance. The fixed-point 8x8 IDCT and DCT algorithms adopted in ISO/IEC FCD standard are also described. Their architecture has benefited from the ideas contributed to the MPEG standardization process by multiple proponents and yielded a remarkably efficient implementation, surpassing all IEEE 1180 ISO/IEC precision requirements, with low implementation complexity (requiring only 44 addition and 20 shift operations per scaled 1D transform), and performing very well in linearity, extended dynamic range, and IDCT drift tests. 6. ACKNOWLEDGEMENTS The authors wish to thank Honggang Qi, Wen Gao, Debin Zhao, Siwei Ma, and other participants in the MPEG IDCT project for their contributions influencing the design of the ISO/IEC FCD algorithm. The authors also wish to thank Joan Mitchell and Gary Sullivan for their helpful comments on the manuscript of this paper. Proc. of SPIE Vol

91 REFERENCES 1. K. R. Rao, and P.Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications, AcademicPress, San Diego, W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Compression Standard, Van Nostrand Reinhold, New York, J. L. Mitchell, W. B. Pennebaker, D. Le Gall, and C. Fogg, MPEG Video Compression Standard, Chapman & Hall, New York, ITU-T Recommendation T.81 ISO/IEC : Information Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements and Guidelines, ISO/IEC 11172: Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s Part 2 - Video, August, ITU-T Recommendation H.262 ISO/IEC : Information technology Generic coding of moving pictures and associated audio information: Video, ITU-T Recommendation H.263: Video Coding for Low Bit Rate Communication, ISO/IEC : Information technology Coding of audio-visual objects Part 2: Visual, July, ANSI/IEEE , Standard Specifications for the Implementations of 8 8 Inverse Discrete Cosine Transform, December 1990 (withdrawn by ANSI 9 September, 2001; withdrawn by IEEE 7 February, 2003). 10. G. J. Sullivan, Standardization of IDCT approximation behavior for video compression: the history and the new MPEG-C parts 1 and 2 standards, SPIE Applications of Digital Image Processing XXX, Proc. SPIE, Vol. 6696, August 28 31, 2007 (this conference). 11. ISO/IEC JTC1/SC29/WG11 (MPEG), Call for Proposals on Fixed-Point IDCT and DCT Standard, Moving Picture Experts Group (MPEG) output document N7335, Poznan, Poland, July ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC CD : Fixed-point IDCT and DCT, Moving Picture Experts Group (MPEG) output document N8479, Hangzhou, China, October 2006, 13. ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC FCD : Information technology MPEG video technologies, Part 2: Fixed-point 8x8 IDCT and DCT, Moving Picture Experts Group (MPEG) output document N8983, San Jose, CA, April ISO/IEC : Information technology MPEG video technologies Part 1: Accuracy requirements for implementation of integer-output 8x8 inverse discrete cosine transform, December ISO/IEC JTC1/SC29/WG11 (MPEG), ISO/IEC /FPDAM1: Information technology MPEG video technologies Part 1: Accuracy requirements for implementation of integer-output 8x8 inverse discrete cosine transform. Amendment 1: Software for Integer IDCT Accuracy Testing, Moving Picture Experts Group (MPEG) output document N8981, San Jose, CA, April M. A. Isnardi, Description of Sample Bitstream for Testing IDCT Linearity, Moving Picture Experts Group (MPEG) document M13375, Montreux, Switzerland, April C. Zhang and L. Yu, On IDCT Linearity Test, Moving Picture Experts Group (MPEG) document M13528, Klagenfurt, Austria, July Z. Ni and L. Yu, Drift Problem of Fixed-Point IDCT on News Sequence, Moving Picture Experts Group (MPEG) document M13912, Hangzhou, China, October G. Sullivan, Video IDCT static-content pathological drift: Analysis and encoder techniques, Moving Picture Experts Group (MPEG) document M14077, Marrakech, Morocco, January E. Feig and S. Winograd, On the multiplicative complexity of discrete cosine transforms (Corresp.), IEEE Trans. Info. Theory, vol. IT-38, pp , July E. Feig, and S. Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans. Signal Processing, vol. 40, pp , September N. Ahmed, T. Natarajan and K. R. Rao, Discrete cosine transform, IEEE Trans. Comput, vol.c-23, pp , January W. Chen, C. H. Smith and S. C. Fralick, A Fast Computational Algorithm for the Discrete Cosine Transform, IEEE Trans. Comm., vol. com-25, No. 9, pp , September B. G. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp , May Proc. of SPIE Vol

92 25. Y. Arai, T.Agui and M. Nakajima, A Fast DCT-SQ Scheme for Images, Transactions of the IEICE E71(11): 1095, November (in Japaneze) 26. M. Vetterli and A. Ligtenberg, A Discrete Fourier-Cosine Transform Chip, IEEE Journal on Selected Areas in Communications, Vol. 4, No. 1, pp , January C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical fast 1-D DCT algorithms with 11 multiplications, in Proc. IEEE Int. Conf. Acoust., Speech, and Sig. Proc. (ICASSP 89), vol. 2, pp , February Y.A. Reznik, A.T. Hinds, N. Rijavec, Low-Complexity Fixed-Point Approximation of Inverse Discrete Cosine Transform, InProc.IEEEInt.Conf.Acoust.,Speech,andSig.Proc.(ICASSP 07), Honolulu, HI, April 15-20, A. Navarro, A. Silva, Y. Resnik, A Full 2D IDCT with Extremely Low Complexity, SPIE Applications of Digital Image Processing XXX, Proc. SPIE, Vol. 6696, August 28 31, 2007 (this conference). 30. Y. Reznik, A. Hinds, C. Zhang, L. Yu, and Z. Ni, Response to CE on Convergence of Scaled and Non-Scaled IDCT Architectures, Moving Picture Experts Group (MPEG) document M13650, Klagenfurt, Austria, July Y. Reznik, Analysis of hybrid scaled/non-scaled IDCT architectures, Moving Picture Experts Group (MPEG) document M13705, Klagenfurt, Austria, July J. W. S. Cassels, An Introduction to Diophantine Approximations, Cambridge University Press, D. Knuth, The Art of Computer Programming: Seminumerical Algorithms, vol. 2. Addison-Wesley, A. Avizienis, Signed-digit number representations for fast parallel arithmetic, IRE Transactions on Electronic Computers, Vol. EC- 10, pp A. Karatsuba and Y. Ofman, Multiplication of multidigit numbers on automata, Soviet Phys. Doklady, Vol. 7, No. 7, pp , January P. R. Cappello and K. Steiglitz, Some complexity issues in digital signal processing, IEEE Trans. Acoustics, Speech Signal Proc., ASSP-32, 5, pp , R. Bernstein, Multiplication by Integer Constants, Software - Practice and Experience, Vol 16, No. 7, pp , R.M.Hewlitt and E.S. Swartzlantler, Canonical signed digit representation for FIR digital filters, in Proc. IEEE Workshop on Signal Processing Systems (SiPS 2000), pp , V. Lefèvre, Moyens arithmétiques pour un calcul fiable, PhD thesis,école Normale Supérieure de Lyon, Lyon, France, January Y. Voronenko and M. Püschel, Multiplierless multiple constant multiplication, ACM Trans. Algorithms, 3, 2, pp. 11-, May, M. Zhou, and J. De Lameillieure, IDCT output range in MPEG video coding, Signal Processing: Image Communication, Vol. 11, No. 2, pp , Dec A.T. Hinds, Z.Ni, C.Zhang, L.Yu, and Y.A. Reznik, On IDCT Drift Problem, SPIE Applications of Digital Image Processing XXX, Proc. SPIE, Vol. 6696, August 28 31, 2007 (this conference). 43. MPEG-2 TM5 source code and documentation: XVID open source implementation of MPEG-4 ASP: APPENDIX A. SOME FACTS FROM DIOPHANTINE APPROXIMATION THEORY AND PROOF OF LEMMA 2.1 Let θ be a real number, and let us try to approximate it by a rational fraction θ p/q. Here, p and q are integers and, without loss of generality, it can be assumed that q>0. Given a fixed q, it can be noted that the precision of best approximation p/q satisfies: θ p/q = q 1 qθ p = q 1 min z Z qθ z = q 1 qθ, Proc. of SPIE Vol

93 where θ denotes the distance of θ to the nearest integer. Based on the above, it appears that the magnitude of error should decrease inverse proportional to q. Nevertheless such approximations can be much more precise. We quote the following result from [32, p. 11, Theorem V] (in this context the term equivalent means multiple by any integer factor). Theorem A.1. Let θ be irrational. Then there are infinitely many q such that q qθ < 5 1/2. ( If θ is equivalent to /2 1 ) then the constant 5 1/2 cannot be replaced by any smaller constant. If θ is not ( equivalent to /2 1 ), then there are infinitely many q such that: q qθ < 2 3/2. Even more notable is the result concerning the achievable precision of simultaneous approximations of irrational numbers θ 1,...,θ n by fractions p 1 /q,...,p n /q, with common denominator q (see [32, p. 14, Theorem III]): Theorem A.2. There are infinitely many integers q such that q 1/n max { qθ 1,..., qθ n } < n n +1. This means that there exist sets of integers (p 1,...,p n,q) such that: max { θ 1 p 1 /q,..., θ n p n /q } < n n +1 q 1 1/n. It can be seen that our Lemma 2.1 is a simple consequence of the above fact, where we additionally introduce parameter k N, andsetξ := q 2 k. That is, by multiplying both sides of the last formula by ξ we obtain: max { ξθ1 p 1 /2 k,..., ξθn p n /2 k } < n n +1 ξ 1/n 2 k(1+1/n). APPENDIX B. SIGN-SYMMETRIC RIGHT SHIFT OPERATOR AND PROOFS OF THEOREMS For the purpose of our analysis we will need to introduce the following operation. Definition B.1. computed as follows: The Sign-Symmetric Right Shift (SSRS) x sym >> b of an integer variable x by b 1 bits is x >> sym b := (x >>(b +1)) (( x) >> (b +1)), (31) where >> denotes the ordinary (arithmetic) right shift operation. Based on its definition, it is easy to see that ( x) >> sym b = (x >> sym ) b, (32) which implies that it satisfies the sign-symmetry property. The proof of Theorem 2.2 follows by construction: we take any existing non-sign symmetric algorithm, and replace all its right shifts with SSRS operators. The complexity of the SSRS operator is 2 shifts, 1 addition, and 1 negation. The total complexity is 4 operations. Theorems 2.3 and 2.4 follow. Proc. of SPIE Vol

95 Publication 2 Arianne T. Hinds, Yuriy A. Reznik, Lu Yu, Zhibo Ni, Cixun Zhang, Drift Analysis for integer IDCT Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp , San Diego, 28 Aug

97 Invited Paper Drift Analysis for Integer IDCT Arianne T. Hinds a, Yuriy A. Reznik b, Lu Yu c, Zhibo Ni c, Cixun Zhang c a InfoPrint Solutions Company, 6300 Diagonal Highway, Boulder, CO, USA 80301; b QUALCOMM Incorporated, 5775 Morehouse Drive, San Diego, CA, USA 92121; c Zhejiang University, Hangzhou, China ; ABSTRACT This paper analyzes the drift phenomenon that occurs between video encoders and decoders that employ different implementations of the Inverse Discrete Cosine Transform (IDCT). Our methodology utilizes MPEG-2, MPEG-4 Part 2, and H.263 encoders and decoders to measure drift occurring at low QP values for CIF resolution video sequences. Our analysis is conducted as part of the effort to define specific implementations for the emerging ISO/IEC Fixed-Point 8x8 IDCT and DCT standard. Various IDCT implementations submitted as proposals for the new standard are used to analyze drift. Each of these implementations complies with both the IEEE Standard 1180 and the new MPEG IDCT precision specification ISO/IEC Reference implementations of the IDCT/DCT, and implementations from well-known video encoders/decoders are also employed. Our results indicate that drift is eliminated entirely only when the implementations of the IDCT in both the encoder and decoder match exactly. In this case, the precision of the IDCT has no influence on drift. In cases where the implementations are not identical, then the use of a highly precise IDCT in the decoder will reduce drift in the reconstructed video sequence only to the extent that the IDCT used in the encoder is also precise. Keywords: IDCT, DCT, drift, artifacts, video standardization, MPEG, MPEG-2, MPEG-4, H.263, MPEG-C 1. INTRODUCTION The Discrete Cosine Transform (DCT) 1 is a fundamental operation used by the vast majority of today s image and video compression standards, such as JPEG, MPEG-1, MPEG-2, MPEG-4 (P.2), H.261, H.263, and others 1-5. Encoders in these standards apply such transforms to each 8x8 block of pixels to produce DCT coefficients, which are then subject to quantization and encoding. The Inverse Discrete Cosine Transform (IDCT) is used in both the encoder and decoder to convert DCT coefficients back to the spatial domain. At the time when the first image and video compression standards were defined, the implementation of DCT and IDCT algorithms was considered a major technical challenge, and therefore, instead of defining a specific algorithm for computing it, ITU-T H.261, JPEG, and MPEG standards have included precision specifications that must be met by IDCT implementations conforming to these standards 8. This decision has allowed manufacturers to use the best optimized designs for their respective platforms. However, the drawback of this approach is the impossibility to guarantee exact decoding of MPEG-encoded videos across different decoder implementations. It was further observed that in some cases the mismatch caused by different IDCT implementations in the encoder and decoder can accumulate, leading to degradations in quality of reconstructed video. This effect has become well known in video compression/distribution industry as IDCT drift, and it has catalyzed the development of various modifications and work-arounds in MPEG encoder implementations (such as periodic Intra-macroblock refreshes, John Morris test 8, and others) aiming at reducing it. Still, in some modes/profiles of today s video coding standards (such as when using the deblocking filter in H.263 standard, or the quarter-sample motion supported in MPEG-4 part 2 standard) the IDCT drift Further author information: (please send correspondence to Yuriy A. Reznik); Yuriy A. Reznik: yreznik@ieee.org, phone: +1 (858) , Arianne T. Hinds: arianne@us.ibm.com, Cixun Zhang: cixunzhang@hotmail.com, Lu Yu: yul@zju.edu.cn, Zhibo Ni: hzjimmy@hotmail.com. The work of Lu Yu, Cixun Zhang, and Zhibo Ni was supported by NSFC under contract No and Project of Science and Technology Plan of Zhejiang Province under contract No. 2005C Applications of Digital Image Processing XXX, edited by Andrew G. Tescher Proc. of SPIE Vol. 6696, , (2007) X/07/$18 doi: / Proc. of SPIE Vol

98 remains to be of significant influence on video quality 9-11, and its further study is likely to be of interest for the community of engineers implementing such video coding standards. This paper presents results of an experimental study to analyze drift that occurs between video encoders and decoders that employ different implementations of the IDCT. Our experiments and associated results are collected using MPEG- 2, H.263, and MPEG-4 Part 2 testbeds as described below, in the course of the efforts to design a fixed-point IDCT implementation for ISO/IEC Our paper is organized as follows. Section 2 analyzes the IEEE 1180 and ISO/IEC metrics used to measure the precision of IDCT approximations (with respect to the ideal integer-valued IDCT), and the relationship of these metrics to the drift phenomenon. Section 3 will review the results of our experimental analysis to measure drift with multiple fixed-point IDCT implementations with respect to the above precision metrics. Section 4 presents our conclusions 2. IDCT PRECISION SPECIFICATIONS AND DRIFT PROBLEM The mathematical definition of the Inverse Discrete Cosine Transform (IDCT) of size 8x8 is provided by the following equation: 7 7 cucv ( ) ( ) (2x+ 1) uπ (2y+ 1) vπ f( x, y) = F( u, v)cos cos, (x,y=0,,7) (1) u= 0 v= 0 where cu ( ) = 1/ 2 for u= 0, otherwise 1, and where F(u,v) is the matrix containing forward DCT coefficients computed for a block of sample pixel values, and f(x,y) is the corresponding matrix of reconstructed sample pixel values. As evident from (1), the direct computation of IDCT involves products by cosine values of various angles, most of which are irrational values, and which can t be computed exactly using today s conventional (finite-precision) computers. Hence in order to realize efficient and practical hardware and software implementations of the IDCT, fixedpoint approximations of the irrational values are often identified so that the required bit-depth for the final implementation can be bounded in exchange for some amount of tolerable error. The amount of acceptable error for fixed point IDCT implementations for use in MPEG-1, MPEG-2, and MPEG-4 Part 2 standards was originally defined by the IEEE Standard The current version of this specification is provided by the ISO/IEC (also known as MPEG-C Part 1) standard. 2.1 IEEE 1180 and ISO/IEC Basics The IEEE 1180 standard and its replacement ISO/IEC specify a procedure for producing a test sequence of Q sample blocks of DCT coefficients, which are subsequently converted into the spatial domain using the reference (double-precision floating point) IDCT with its outputs rounded to integer values. In these tests, Q is defined to be either or , and the DCT coefficients are generated assuming that pixel values are in the ranges of [-5,5], [-256, 255], [-300, 300], or [-384,383]. The outputs of the IDCT under test are then compared to the corresponding integer outputs of the reference IDCT produced using the same set of inputs. The output matrices from these transforms are denoted by hz [][] and gz [][] for the reference and approximate IDCTs respectively. The IEEE 1180 and ISO/IEC precision specifications define the following set of accuracy criteria that are to be collected across the series of tests: Peak pixel error (ppe), p: p[ y][ x] = max h [ y][ x] g [ y][ x] ; Pixel mean error (pme), d[y][x]: [ y][ x] = ( h' ' [ y][ x] g' ' [ y][ x] ) z Q 1 z= 0 Q 1 z= 0 z 1 d z z ; Q 1 2 e [ y][ x] = h' ' z [ y][ x] g' ' z [ y][ x] Q Pixel mean square error (pmse), e[y][x]: ( ) Overall mean error (ome): m = x= 0 y= 0 d[ y][ x] z Proc. of SPIE Vol

99 Overall mean square error (omse): n = x= 0 y= 0 e[ y][ x] We note that only three from the above set of criteria satisfy metric properties (i.e. they are non-negative, commutative, and satisfy the triangle inequality). These are the ppe, pmse, and omse metrics which can also be considered as variants of the well known L 2 and L measures. In contrast, the other two criteria: pme and ome do not satisfy metric properties, and hence, do not offer an indication of how precise the approximate IDCT is with respect to the reference algorithm. Instead, they measure how well the errors produced by an approximate transform are balanced around zero. Hence, there are complimentary, and in some cases, redundant test criteria. For instance, due to: d[y][x] e[y][x] m n it becomes obvious, that designs minimizing pmse and omse should simultaneously minimize pme and ome, and that passing pme and ome thresholds in pmse and omse dimensions correspondingly would simply render pme and ome tests unnecessary. The IEEE 1180 and ISO/IEC specifications define the following thresholds for these criteria: p maximum absolute difference between reconstructed pixels (required: p 1); d[x,y] average differences between pixels (for all [x,y]: d [ x, y] ); m average of all pixel-wise differences (required: m ); e[x,y] average square difference between pixels (for all [x,y]: ex [, y] 0.06); n average of all pixel-wise square differences (required: n ). It can be noted that the thresholds set for pme and ome are much smaller than for square-error type metrics: pmse, and omse. 2.2 Errors between ISO/IEC compliant IDCT implementations Given a sample sequence x 1 x Q of test matrices (assumingly generated by some stationary stochastic process) and two IDCT implementations F 1 and F 2 correspondingly, we are interested in studying the distance between outputs produced by these algorithms: (x) =М(F 1 (x),f 2 (x)) or =М( F 1,F 2 ), assuming that the sequence of (x 1 ) (x Q ) converges to some quantity. Operator M in both expressions is a given IEEE 1180 ISO/IEC criteria with metric properties (such as ppe, pmse, or omse). Then, by the triangle inequality: =М( F 1,F 2 ) М( F 1,F ref )+ М( F 2,F ref ) 2T where T is a threshold defined for the distance in terms of metric M to the reference transform F ref. Hence, by upper-bounding ppe, pmse, and omse, the IEEE 1180 ISO/IEC precision specifications provide a means for controlling the maximum pair-wise error caused by any two (precision compliant) implementations of the IDCT transform. 2.3 Relation of IDCT Precision Metrics to Drift An illustration explaining the processing of data in video encoder and decoder is provided in Fig. 1. Proc. of SPIE Vol

100 x ( ) e n Forward DCT Quantization y ( ) e n y ( ) e n x ( ) p n Inverse Quantization Inverse Quantization x '() n p x '( n ) e IDCT F 1 (fixed-point) IDCT F 2 (fixed-point) x'( n ) + e xr ( n ) Reconstruction Loop Encoder Decoder x '( n ) r Fig. 1. A simplified view of the processes in video encoder and decoder modules for inter-coded blocks. In this Figure, the implementation of IDCT F 1 in the encoder s reconstruction loop does not exactly match the implementation of IDCT F 2 in the decoder. For the inter-coding of blocks, the residual block xe ( n ) is first transformed by the forward DCT and then quantized to produce coefficient block ye ( n ) in the encoder. ye ( n) is then provided as input to both the decoder and the encoder s reconstruction loop where it undergoes the reverse process of inverse quantization followed by inverse transformation. In the encoder, IDCT F 1 computes the reconstructed residual block x '( e n ) while in the decoder, IDCT F 2 computes the reconstructed residual block x'( e n ) +. These two blocks differ because the implementations of F 1 and F 2 used to compute them are not exactly the same; both are compliant with the IEEE 1180 and/or ISO/IEC specifications, but each computes its corresponding reconstructed residual block with some amount of error. The difference between these two blocks, denoted by, is the error introduced into the reconstructed frame at the decoder. In turn, the current reconstructed block x ( n) = x ( n) + x ( n) is obtained as a sum of the current predicted block x ( n ) r p e and the reconstructed residual. The same processing is done in the decoder, but the result is different by the IDCTintroduced error : x'( r n) = x' p( n) + x' e( n) +. In processing of subsequent frames, the reconstructed blocks are used to derive new predicted blocks xp ( n+ 1) and x' p ( n+ 1) in the encoder and decoder correspondingly. Hence, the IDCT error introduced in the previous frame will now reappear in the predicted block in the decoder: ( ) ( ( )) ( ) x ' p( n+ 1 ) = f x' r ( n) = f xr n + f ', n, where f '(, n) Moreover, if the predictor function f (). is linear, then it can be shown that f '(, n) n f ( ) In general, we can assume that drift function f '(, n) is a function describing the accumulation of IDCT drift in the predictor after processing of n frames. error, and hence if is bounded: then, correspondingly: ( ) ( ) =. is monotonic and increasing with respect to point-wise IDCT =М( F 1,F 2 ) М( F 1,F ref )+ М( F 2,F ref ) 2T ( ) ( ) ( 1 2 ) ( 1 ref ) ( 2 ref ) f ', n = f ' M F, F, n f ' M F, F + M F, F, n f ' 2 T, n, where T is the IEEE 1180 ISO/IEC threshold set for metric M. p Proc. of SPIE Vol

101 Based on the above discussion, it can be noted that the only way to ensure that drift is eliminated entirely is to make IDCT implementations F 1 and F 2 identical. Making only one of these IDCTs very precise with respect to reference IDCT will reduce drift only to or ( ) ( 1 ref ) = ( ) ( 2 ref ) M F, F 0 f ', n f ' M F, F, n ( ) ( 2 ref ) = ( ) ( 1 ref ) M F, F 0 f ', n f ' M F, F, n which we will call a self-drift of the IDCT on the other end of the system. Hence, the IDCT problem is generally not solvable by changing the design of only one IDCT (either in the encoder or decoder). 3. EXPERIMENTAL STUDY OF IDCT DRIFT PROBLEM This section summarizes the results of our experimental study 10 on the precision performance of IDCTs submitted in response to the MPEG call for proposals 11. In this study, we explore the relative drift observed when multiple IDCT implementations are used in both the encoder and decoder, including cases when some existing well-known IDCT implementations are used in the encoder. We also compare the resulting measured drift with the IEEE 1180 metrics to further analyze the relation of these metrics to the reduction of drift. 3.1 Methodology The following describes our testing methodology, and provides the rationale for selecting particular operating modes for the encoder, quantization parameter (QP) values, and test sequences Testbed Encoders/Decoders and Precision Testbed For our experiments, we have used the reference software of the MPEG-2 12, MPEG-4 13, and H video coding standards to serve as the basis for encoder and decoder testbeds. These testbeds disable rate control, as well as some other standard features, such as the forced insertion of intra macroblocks in order to amplify IDCT drift effects. Also, for the same purpose, the H.263 testsbed employs Annex T of this ITU-T standard 15, which allows transmission of transform coefficients outside the normal clipping range of [-2048,2047]. We also employ the use of a precision testbed to measure the conformance of IDCT implementations to the precision metrics of IEEE This particular testbed is also developed jointly by several members of the MPEG ad hoc group responsible for the development of parts one and two of ISO/IEC Its primary purpose is to perform the requisite series of tests defined in IEEE 1180 for each IDCT under test, and to report any failures of the corresponding IDCT with respect to the thresholds defined for each precision metric. Notably, each of these software testbeds is provided in an amendment to the original ISO/IEC The aim of this amendment is to provide the same software tools used by MPEG to aid developers of fixed-point IDCT implementations with their analysis of IDCT drift and precision performance IDCT Algorithms We substitute the IDCT implementations in the corresponding reference software for each testbed, with several IDCT implementations submitted for consideration for the new standard. Each of these proposed algorithms has been verified to conform to the metrics specified in IEEE 1180 using the above precision testbed. In addition to the proposed algorithms we ve also included several examples of well-known existing IDCT implementations: MPEG-2 TM5 IDCT fixed-point implementation included in MPEG-2 reference software 18, XVID IDCT a high-accuracy fixed-point implementation of IDCT in XVID opens source implementation of MPEG-4 P2 codec 19, Proc. of SPIE Vol

102 H.263 Annex W IDCT idct algorithm specified in Annex W of ITU-T Recommendation H.263, and a 64-bit floating-point IDCT implementation. The full list of all algorithms used in this study is presented in Table 1 with their names denoting the following characteristics identified in the list below. In these descriptions, LLM11 denotes the Loeffler, Ligtenberg, and Moschytz algorithm with 11 multiplication operations required, while LLM12 denotes the Loeffler, Ligtenberg, and Moschytz algorithm with 12 multiplication operations 20. Annn: An algorithm based on the Arai, Agui, and Nakajima factorization 21 Bnnn: An algorithm based on the factorization by B.G. Lee 22, implemented with right shift and addition operations Cnnn: An algorithm based on the Chen factorization 23 implemented with right shift and addition operations CWnnn: An algorithm based on the Chen-Wang factorization 24 implemented with fixed-point multipliers or left shift and addition operations FWnnn: A non-separable two-dimensional algorithm based on the Feig-Winograd 25 factorization implemented with right shift and addition operations Mnnn: an algorithm based both on LLM12 and fixed-point multiplication or left shift and addition operations Nnnn: an algorithm based both on LLM12 and right shift and addition operations Tnnn: an algorithm based both on LLM11 and lifting steps implemented with shift and addition opertions Rnnn: an algorithm based on LLM11 with a scaling stage implemented in each one-dimensional transform. Snnn: an algorithm based on LLM11, lifting steps, and designed specifically to achieve high precision Innn: an algorithm that approximates the cosines in the IDCT matrix by scaling them by powers of two, and then performs matrix-vector multiplication operations to compute the outputs Test Sequences In order to conduct our tests, we have used several standard CIF-resolution sequences consisting of the Foreman, Mobile, News, and Football test sequences. The Foreman sequence is 300 frames in length and characterized by its low motion with little static region content. The Mobile sequence is 300 frames in length, also with low motion, but slightly more static region content than Foreman. News, is also 300 frames, but can be characterized by its balanced content of high motion (the dancers), low motion (the news anchors), and static content (the news room background). Football consists of only 150 frames with content that can be characterized as high motion and little static content QP Values In order to understand at which QP level the difference between various IDCT implementations will become noticeable, we have conducted a series of pretests with the H.263+ testbed using the floating-point IDCT in the encoder and a subset of IDCT proposals operating at various degrees of precision in the decoder. The CIF sequence Foreman was used for this purpose. The results of our pretests for QP=1, 2, and 3 are presented in Figures 2-4. Based on these results it can be seen that IDCT drift with respect to the floating point reference IDCT can be seen most clearly at QP=1 (with Annex T enabled), but the differences between various IDCTs is sharply declining with the use of higher QP values. In fact, even at QP=3, these differences are within 0.2 db. Hence, in our subsequent experiments we choose to consider only the case of QP=1 as this is the mode in which drift effects are most amplified *. * Nevertheless, all subsequent results should be understood as an extreme illustrations of the drift phenomena, which in practical video applications (ones operating in higher QP ranges and using normal Intra-refreshes) may not be present or have much small influence on quality of decoded video. Proc. of SPIE Vol

103 Table 1. IDCT/DCT algorithms used in H.263/MPEG-4/MPEG-2 testbeds Index Algorithm Worst-case IEEE-1180 precision metrics Ppe pmse omse pme ome 0 T T R R R R S CW C A A I A A A A A A A M M M N N FW A A B MPEG2_TM XVID H263_Annex_W * bit floating point IDCT ~0 ~0 ~0 ~0 ~0 (*)Excluding test ranges and The above table also lists the worst-case IEEE 1180 metrics obtained by the precision testbed for each of the algorithms IDCT Algorithms used in the Encoder In addition to an idealized case where the floating point reference IDCT is employed by the encoder, we also utilize the following fixed point algorithms in our tests: H.263 Annex W IDCT An IDCT algorithm specified in Annex W of ITU-T Recommendation H.263, MPEG-2 TM5 IDCT A fixed-point implementation included in MPEG-2 reference software, XVID IDCT A high-accuracy fixed-point implementation of the IDCT in the XVID open source implementation of the MPEG-4 Part 2 encoder/decoder, These algorithms cover a wide range of possible precision characteristics, and the first three of them are well known and used in various implementations of MPEG and ITU-T encoders/decoders. Proc. of SPIE Vol

104 Foreman CIF 300 frames, QP=1, Annex T PSNR (Y) S001 N002 N001 A003 A004 A006 A Frame number Fig. 2. Drift under reference IDCT in the encoder at QP= Foreman CIF 300 frames, QP=2, Annex T PSNR (Y) S001 N002 N001 A003 A004 A006 A Frame number Fig. 3. Drift under reference IDCT in the encoder at QP=2. Proc. of SPIE Vol

105 Foreman CIF 300 frames, QP=3, Annex T PSNR (Y) Frame number Fig. 4. Drift under reference IDCT in the encoder at QP=3. S001 N002 N001 A003 A004 A006 A Results In this section we provide a subset of the exhaustive set of results that we have produced using all variations of testbeds, encoder IDCT implementations, and test sequences Results produced using the H.263 Encoder/Decoder Figures 5-8 provide results that are produced using the H.263 encoder/decoder and with the Foreman video sequence Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = H.263 Annex W IDCT PSNR (Y) H263 Annex W T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B Frame number Fig. 5. Drift with respect to the H.263 Annex W IDCT implemented in the encoder of the H.263 testbed. Proc. of SPIE Vol

106 Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = MPEG-2 TM5 IDCT PSNR (Y) MPEG-2 TM5 T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B Frame number Fig. 6. Drift with respect to the MPEG-2 TM5 IDCT implemented in the encoder of the H.263 testbed Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = XVID high accuracy IDCT PSNR (Y) XVID IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B Frame number Fig. 7. Drift with respect to the XVID high accuracy IDCT implemented in the encoder of the H.263 testbed. Proc. of SPIE Vol

107 Foreman CIF 300 frames, QP=1, Annex T Encoder's IDCT = 64-bit floating-point IDCT PSNR (Y) Frame number Float.point IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001 Fig. 8. Drift with respect to the 64-bit floating point IDCT implemented in the encoder of the H.263 testbed. It can be seen in Figure 5, that the highest video fidelity is obtained when the H.263 IDCT is used in both the encoder and the decoder, and that the remaining IDCT implementations yield results considerably lower in PSNR regardless of their precision performance with respect to the metrics specified in IEEE It can also be noted that the curves corresponding to IDCTs with higher-precision performance tend to be clustered together, ultimately centered around a curve, corresponding to the use of the ideal IDCT in the decoder. We refer to this curve as the self-induced drift of the fixed-point IDCT in the encoder. Notably, we observe the same behavior when other (more precise) IDCTs are employed in the encoder. Referring to Figures 6 and 7, we also note that with an IDCT operating at higher precision in the encoder, the differences between the curves produced by the various IDCTs used in the decoder are getting larger. But still, there is a clear concentration around a self-induced drift curve. However, in Figure 8, when a 64-bit floating-point IDCT is used in the encoder, the results are dramatically different. Here, all curves corresponding to the use of a fixed-point IDCT in the decoder simply fall below the top curve (for the case where a floating-point IDCT is used in both the encoder and decoder). There is also a significant variation between performances of different fixed-point IDCT implementations Results produced using MPEG-4 Encoder Figures 9 through 12 provide a subset of the complete results that are produced using the MPEG-4 Part 2 Advanced Simple Profile encoder/decoder and with the Foreman video sequence. In these results, we also illustrate the effects of enabling and disabling of Quarter Pixel Motion Compensation. Proc. of SPIE Vol

108 Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = MPEG-2 TM5 IDCT QuarterPel Enabled, MPEG2 Quantization PSNR (Y) Frame number MPEG-2 TM5 T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001 float-point IDCT Fig. 9. Drift with respect to the MPEG-2 TM5 IDCT employed in the encoder of MPEG-4, QMC enabled As in the case of the H.263 drift results, here we see a concentration of drift curves around the self-induced drift curve of an IDCT in the encoder. We also note that the overall magnitude of drift when quarter-pixel interpolation is utilized is significantly larger than the corresponding drift produced using half-pixel interpolation and prediction. PSNR (Y) Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = MPEG-2 TM5 IDCT Quarter Pel Disabled, MPEG-2 Quantization Frame number MPEG-2 TM5 T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001 floating-point IDCT Fig. 10. Drift with respect to the MPEG-2 TM5 IDCT employed in the encoder of MPEG-4,QMC disabled Proc. of SPIE Vol

109 Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = 64-bit floating-point IDCT QuarterPel Enabled, MPEG-2 Quantization PSNR (Y) Frame number Float.point IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M002 M003 N001 N002 FW001 A010 A011 B001 Fig. 11. Drift with respect to the 64-bit floating point IDCT implemented in the encoder of the MPEG-4 testbed, QMC enabled Foreman CIF 300 frames, QP=1, Progressive Encoder's IDCT = 64-bit floating-point IDCT Quarter Pel Disabled, MPEG-2 Quantization PSNR (Y) Frame number Float.point IDCT T001 T002 R001 R002 R003 R004 S001 CW01 C001 A001 A002 I001 A003 A004 A005 A006 A007 A008 A009 M001 M02 M003 N001 N002 FW001 A010 A011 B001 Fig. 12. Drift with respect to the 64-bit floating point IDCT implemented in the encoder of the MPEG-4 testbed, QMC disabled Proc. of SPIE Vol

110 3.3 Connection between the Drift and IEEE1180 Accuracy metrics This section attempts to connect the observed drift performance for the various existing IDCT implementations with their IEEE 1180 precision characteristics. The horizontal lines in figures 13 through 16 below mark the average selfinduced drift of an IDCT used in the MPEG-4 encoder. i.e. the average drift values measured by using a particular fixedpoint IDCT in the encoder and a 64-bit floating-point IDCT in the decoder Average drift in first 131 frames vs OMSE Drift wrt H.263 Ann W IDCT Drift wrt MPEG-2 TM5 IDCT Drift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT Drift [db] omse Fig. 13. Average self drift in terms of OMSE with respect to the 64-bit floating point IDCT implemented in the decoder of the MPEG-4 testbed Average drift in first 131 frames vs PMSE Drift wrt H.263 Ann W IDCT Drift wrt MPEG-2 TM5 IDCT Drift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT Drift [db] pmse Fig. 14. Average self drift in terms of PMSE with respect to the 64-bit floating point IDCT implemented in the decoder of the MPEG-4 testbed Proc. of SPIE Vol

111 Average drift in first 131 frames vs OME Drift wrt H.263 Ann W IDCT Drift wrt MPEG-2 TM5 IDCT Drift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT Drift [db] ome Fig. 15. Average self drift in terms of OME with respect to the 64-bit floating point IDCT implemented in the decoder of the MPEG-4 testbed Average drift in first 131 frames vs PME Drift wrt H.263 Ann W IDCT Drift wrt MPEG-2 TM5 IDCT Drift wrt XVID IDCT Drift wrt CW001 IDCT Drift wrt Floating-Point IDCT 1.5 Drift [db] pme Fig. 16. Average self drift in terms of PME with respect to the 64-bit floating point IDCT implemented in the decoder of the MPEG-4 testbed It can be seen that in all cases when an integer IDCT is used in the encoder, the drift results tend to concentrate around self-induced drift of the IDCT employed in the encoder. This effect is observed for all metrics. 4. CONCLUSION This paper presents the results of our experimental drift analysis conducted in conjunction with the development of (MPEG-C Part 2) ISO/IEC Fixed-Point 8x8 IDCT and DCT. Our main conclusion from this study is that drift is caused by mismatches between IDCT implementations in the encoder and decoder, and that such drift is eliminated entirely only when these IDCT implementations are identical. Moreover, we present results that suggest that the Proc. of SPIE Vol

112 relationship between the precision of an IDCT approximation (in terms of IEEE 1180 metrics with respect to the ideal integerized IDCT) and the prevention of drift, is not easily identified. That is, our evidence suggests that the use of a highly precise IDCT in the decoder can eliminate drift only to the extent that the corresponding IDCT implementation used in the encoder is also precise. Hence, the goal of a completely drift-free operation is not attainable by employing a highly precise IDCT in only the decoder (or only the encoder). Our study also suggests that the drift performance of a particular IDCT approximation should be analyzed by empirically measuring drift resulting between the IDCT approximation implemented in the decoder (or encoder) against a typical or common set of IDCT implementations in the corresponding encoder (or decoder) system. This analysis is especially useful to assess the IDCT performance for a variety of existing bitstreams, encoders, and decoders. To aid in this analysis, ISO/IEC provides a set of drift testbeds with the goal of measuring drift in MPEG-2, H.263, and MPEG-4 (Pt.2), under common test conditions. REFERENCES 1. ITU-T Recommendation T.81 ISO/IEC : Information Technology -- Digital Compression and Coding of Continuous-Tone Still Images -- Requirements and Guidelines, ISO/IEC 11172: Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s -- Part 2 -Video, August, ITU-T Recommendation H.262 ISO/IEC : Information technology -- Generic coding of moving pictures and associated audio information: Video, ISO/IEC : Information technology -- Coding of audio-visual objects -- Part 2: Visual, July, ITU-T Recommendation H.263: Video Coding for Low Bit Rate Communication, ANSI/IEEE , Standard Specifications for the Implementations of 8 8 Inverse Discrete Cosine Transform, December 1990 (withdrawn by ANSI 9~September, 2001; withdrawn by IEEE 7~February, 2003). 7. ISO/IEC : Information technology -- MPEG video technologies -- Part 1: Accuracy requirements for implementation of integer-output 8x8 inverse discrete cosine transform, December G. J. Sullivan, Standardization of IDCT approximation behavior for video compression: the history and the new MPEG-C parts 1 and 2 standards, SPIE Applications of Digital Image Processing XXX, Proc. SPIE, Vol. 6696, August , 2007 (this conference). 9. Yi-Shin Tung, Ja-Ling Wu, Jim Shiu, Zhixiong Wu, Quality Degradation Problem: IDCT Mismatch Propagated and Scaled by MPEG-4 Quarter-pel Filtering, Moving Picture Experts Group (MPEG) document M9822, July Z. Ni and L. Yu, On the Problem of Quater Pixel Motion Compensation, Moving Picture Experts Group (MPEG) document M14544, April A. T. Hinds, Y. A. Reznik, P. Sagetong, On IDCT Exactness, Precision, and Drift Problem, Moving Picture Experts Group (MPEG) document M13657, July Z. Ni and L. Yu, The Drift Problem of Fixed-Point IDCT on News Sequence, Moving Picture Experts Group (MPEG) document M13912, Oct ISO/IEC JTC1/SC29/WG11 (MPEG), Call for Proposals on Fixed-Point IDCT and DCT Standard, Moving Picture Experts Group (MPEG) output document N7335, Poznan, Poland, July MPEG-2 TM5 source code and documentation: XVID open source implementation of MPEG-4 ASP: Proc. of SPIE Vol

113 Publication 3 Cixun Zhang, Lu Yu, Multiplier-less Approximation of the DCT/IDCT with Low Complexity and High Accuracy Proceedings SPIE Applications of Digital Image Processing XXX, Vol. 6696, pp , San Diego, 28 Aug

114

115 Invited Paper Multiplier-less Approximation of the DCT/IDCT with Low Complexity and High Accuracy Cixun Zhang and Lu Yu Institute of Signal Processing, Tampere University of Technology, Tampere, Finland Institute of Information and Communication Engineering, Zhejiang University, Hangzhou, China ABSTRACT This paper presents a straightforward multiplier-less approximation of the forward and inverse Discrete Cosine Transform (DCT) with low complexity and high accuracy. The implementation, design methodology, complexity and performance tradeoffs are discussed. Particular, the proposed IDCT implementations, in spite of simplicity, comply with and can reach far beyond the MPEG IDCT accuracy specification ISO/IEC , and also reduce drift favorably compared to other existing IDCT implementations. Keywords: DCT, IDCT, transform, drift, fixed point, multiplier-less 1. INTRODUCTION The Discrete Cosine Transform (DCT) is widely used in video and image coding applications as it is the transform used in both MPEG [1] and JPEG [2] standards. The Inverse Discrete Cosine Transform (IDCT) is the inverse process of the DCT. Theoretically, the DCT/IDCT is defined as real number operations and the implementation complexity is high, thus reducing the complexity is one of the most important issues when designing DCT/IDCT implementation. Another important issue specifically for IDCT implementation is to reduce the drift problem, which is caused by different IDCT implementations in encoder and decoder. In this paper, a straightforward multiplier-less approximation of the DCT/IDCT is proposed with low complexity and high accuracy. The proposed IDCT implementations comply with and can reach far beyond the MPEG IDCT accuracy specification ISO/IEC [3], and also reduce drift favorably compared to other existing IDCT implementations. The rest of the paper is organized as follows. In Section 2, a detailed description about the proposed implementation and design methodology is presented. Section 3 gives the experimental results and Section 4 gives the complexity and accuracy analysis of the proposed implementations. Section 5 concludes the paper. 2. IMPLEMENTATION AND DESIGN 2.1. Implementation For simplicity, we put most of our discussion on IDCT in the following of this paper and similar design approach can be applied for DCT. Similar to our previous work [6] [7], the fixed point IDCT matrix is derived by (1). The fixed point This research is supported by NSFC under contract No and Project of Science and Technology Plan of Zhejiang Province under contract No. 2005C This work was done when the first author was with Institute of Information and Communication Engineering, Zhejiang University. Please send correspondence to Lu Yu ( yul@zju.edu.cn) Applications of Digital Image Processing XXX, edited by Andrew G. Tescher Proc. of SPIE Vol. 6696, , (2007) X/07/$18 doi: / Proc. of SPIE Vol

116 DCT matrix is its transpose. IDCTfp G A E B G C F D G B F D G A E C G C F A G D E B G D E C G B F A = round( 8 2 IDCT)/2 = /2 G D E C G B F A G C F A G D E B G B F D G A E C G A E B G C F D SCALE SCALE SCALE (1) Such a fixed point IDCT can be easily implemented based on an efficient 12-multiply-32-add butterfly structure first proposed by Loeffler et al [8], as shown in Figure 1 (IDCT Butterfly Structure, not including rounding) and Figure 2 (DCT Butterfly Structure, including rounding) below. The apparent rounding offset used before the right shift in final 1-D IDCT can be implemented by simply adding a constant to the DC term near the beginning of the IDCT process. The DCT butterfly structure is obtained by simply reversing the signal flow in the IDCT butterfly structure. Similarly, the rounding offset used before the right shift in final 1-D DCT can be implemented by 3 additions in the middle of the process as shown in Figure 2. xo xl x2 x3 x4 x5 x6 Figure 1. IDCT Butterfly Structure (not including rounding) Proc. of SPIE Vol

117 k 0' k k - k k k k ) L- -I 4 0 Figure 2. DCT Butterfly Structure (including rounding) For a fixed point IDCT matrix derived by (1), the multipliers in the butterfly structure in Figure 1 and Figure 2 can be obtained by (2) and assured to be solvable and unique. This is mainly due to the fact that there is at most one multiplier in every path inside the butterfly structure. Actually, this characteristic also helps to achieve high accuracy. e = ( E F)/2 0 e = F /2 1 e = ( E F)/2 2 d = ( A+ B + C D)/2 0 d = ( A+ B C + D)/2 1 d = ( A+ B + C D)/2 2 d = ( A+ B C D)/2 3 d = ( B + D)/2 4 d = ( A B)/2 5 d = ( B C)/2 6 d = ( B + C)/2 d 7 8 = B /2 SCALE SCALE SCALE SCALE SCALE SCALE SCALE SCALE SCALE SCALE SCALE SCALE (2) By using right shifts rather than left shifts, the proposed methods can effectively reduce the bit widths needed in the implementation of the fixed point IDCT compared to other IDCT implementations [6] [7] [13] [14]. However, in order to achieve high accuracy, it is necessary to pre-multiply all coefficients by a certain constant C which is often chosen to be implemented by simple left shift before the transform process. That is, we have C=2 UP_SCALE, where UP_SCALE denotes Proc. of SPIE Vol

118 the number of reserved bits. The 1-D IDCT can then be operated directly without spending any cycles on internal pre- or post-scaling, and finally when both rows and columns are processed, we simply need to shift all resulting quantities by UP_SCALE+3 bits to the right. As mentioned above, the apparent rounding operation of IDCT process can be achieved by simply adding the constant (1<<(UP_SCALE+2)) to the DC coefficient right after scaling since there is no right shift after first 1-D IDCT in proposed methods. Such a whole process is denoted by (SCALE, UP_SCALE) in this paper. The block diagram of 2-D IDCT is shown in Figure 3. IXPU-' OUTPUT Figure 3. 2-D IDCT Note that the butterfly structures in Figure 1 and Figure 2 can possibly be adjusted to different (mathematical equivalent or not) structures according to different needs. For one example, there are four 3-multiplication structures in the butterfly structures: 1) e0, e1, e2; 2) d0, d3, d4; 3) d1, d2, d5; 4) d6, d7, d8 and indeed if we replace one or all of them with a simple plane rotation, further reduction of the operation count can be possibly achieved with almost the same accuracy using the technique of [12]. At the same time, this will facilitate other implementations and optimizations, e.g., to change non-scaled implementations into scaled implementations [11] [17] Design Methodology The exact multiplier-less implementation (using binary additions and shifts) of the multipliers in the butterfly structure is crucial to the accuracy of the IDCT implementation. If we use F(i) and I(i) to represent the output of a theoretical multiplier and a practical multiplier when the input is i, then from (2), one straightforward approach is to find a multiplier-less implementation that minimizes the approximation error which, for example, is calculated as (3) or (4) below. Approximation Error = SCALE 2 i= 0 2 ( I( i) F( i)) (3) Approximation Error = SCALE 2 i= 0 I( i) F( i) (4) However, there may be some disadvantages of this approach: 1. The statistics of input data of different multipliers is different from each other. Moreover, it is difficult to estimate the statistics since it may vary greatly under different situations: different sequence and coding methods, etc. 2. The optimization of implementation of each multiplier is independent from each other. However, in fact, the truncating error among different multipliers will interact with one another through the additions/subtracts in the butterfly structure. Experiments also show that independent optimization of each multiplier using criteria like (3) and (4) does not yield best implementation. However, on the other hand, exhaustive search is simply impossible because of the huge number of possible implementations. Therefore, the following methods are introduced to reduce the optimization complexity. Proc. of SPIE Vol

119 1. Only the multiplier-less implementation with minimum number of adders is used [5] for each multiplier. Clearly, this also reduces the implementation complexity. 2. The first step of multiplier-less implementation is restricted to be in the form 1±(1>>x). That is, the shift operations are delayed in the multiplier-less implementations of the multipliers. This not only reduces the number of multiplier-less implementations being considered, but also reduces the truncation error caused by the shifts, especially when the range of input data is small, which is critical in reducing drift. At the same time, this restriction does not affect the worst case results (regardless of input range of random test or number of samples in random test) for the metrics measured in the MPEG IDCT accuracy specification ISO/IEC [3] obviously, but improves the results when input range is small. 3. Early termination is necessary in spite of method 1 and 2. In practice, we can stop search when Overall Mean Square Error (OMSE) becomes steady. Different start points are used to avoid resulting in bad local minimum as best as possible. The reason why OMSE is selected as criterion is given in sub-section 2.3 below. Note it is also possible that only multiplier-less implementations whose approximation error are less than a specified threshold are considered. However, this is not as necessary as the above methods and was not used in our optimization. We used above methods to optimize the original (13,10) proposed in [4]. Our optimized multiplier-less implementation of the multipliers in improved (13,10) is given in Table 1. The reason why (13,10) is selected is because it is one of the most efficient methods among others and this will be detailed in section 4. Note this optimization method can also be used for other IDCT approximations with different butterfly structure Optimization Criterion Selection There are basically six metrics defined in MPEG IDCT accuracy specification ISO/IEC [3]. They are: near-dc inversion, Peak Pixel Error (PPE), Pixel Mean Error (PME), Pixel Mean Square Error (PMSE), Overall Mean Error (OME), Overall Mean Square Error (OMSE). Generally speaking, near-dc inversion test result will be 0 when suitable combination of SCALE and UP_SCALE is selected [4], while PPE will be 1 for fixed point IDCTs. So they are not suitable as a criterion for optimization. Among other four, OMSE turns out to be the most important because PME and OME do not really say how far away the approximate IDCT is from a reference one but instead how well the errors produced by an approximate transform are balanced around 0 [9]. In fact, PME and OME can be effectively reduced by simply using different rounding offset instead of 1/2 for the whole transform. At the same time, OMSE appears more important than PMSE because it takes errors of all the coefficients into account. Therefore, we use OMSE as a criterion in our optimization process. In practice, we can use the OMSE in range [-256, 255] instead of the maximum value of all ranges for accuracy consideration and OMSE in range [-5, 5] (or even smaller range like [-1, 1]) for drift reduction consideration because generally drift is most significant when small QP is used and therefore the input into the transform is small. Table 1. Multiplier-less Implementation of the Multipliers in improved (13,10) e 0 x0=1+(1>>2), e 0 =(1<<1)-(x0>>3)+(1>>8) 4shifts+3additions e 1 x0=1+(1>>4), x1=1+(x0>>2), x2=x0+(x1>>6), e 1 =(x2>>1) e 2 x0=1-(1>>4), e 2 =1-(x0>>2)-(1>>12) 4shifts+3additions 3shifts+3additions Proc. of SPIE Vol

120 d 0 x0=1+(1>>3), x1=x0+(x0>>4)-(1>>10), d 0 =(x1>>2) d 1 x0=1+(1>>3), d 1 =(1<<1)+(x>>5)+(x>>6)+(1>>11) d 2 x0=1+(1>>4), x1=x0+(1>>2), d 2 d 3 x0=1+(1>>10), d 3 d 4 x0=1+(1>>4), x1=1+(x0>>4), d 4 d 5 x0=1+(1>>7), x1=1+(x0>>3), d 5 d 6 x0=1+(1>>2), d 6 d 7 x0=1+(1>>1), x1=1-(x0>>6), d 7 =(1<<1)+x0+(x1>>7) =x0+(x0>>1) =1-(x1>>3)+(x1>>5) =(1<<1)+(x1>>1) =(1<<1)-(x0>>5)+(x0>>11) =(x1>>6)+(x0>>2) d 8 x0=1-(1>>5), x1=(x0>>6)+(1>>2), x2=x0-x1, d 8 Total =1+(x2>>2) 4shifts+3additions 5shifts+4additions 4shifts+4additions 2shifts+2additions 4shifts+4additions 4shifts+3additions 4shifts+3additions 4shifts+3additions 4shifts+4additions 46shifts+39additions 3. EXPERIMENTAL RESULTS Proposed improved (13,10) turn out to be very efficient. In Table 2 we list MPEG IDCT accuracy specification ISO/IEC compliance results based on corresponding testbed in [10]. To show the effectiveness of our optimization method, we also compared the drift of improved (13,10) with original implementation of (13,10) in [4] and the results are shown in Figure 4 and Figure 5, where the suffixes ed, e, d mean that the proposed IDCT was used in both encoder and decoder, used only in encoder, and used only in decoder respectively while for the latter three cases, 64-bit floating point IDCT was used on the other end. In Figure 6 and Figure 7, we also compared improved (13,10) to other existing IDCT implementations like the high accuracy IDCT (13,9,20) proposed in [6] [7], MPEG-2 TM5 IDCT [13] and XVID high accuracy IDCT [14]. It is quite clearly that improved (13,10) performs much better than all these IDCTs and up to several db improvement in PSNR can be achieved. Actually, it also performs quite well compared to other many IDCTs in spite of its simplicity and more results can be found in [15] [16]. Note that the drift tests are carried out under extreme worst-case: only first frame was coded as intra frame, no intra macroblocks in P frame and QP was set to 1. Table 2. MPEG IDCT Accuracy Specification ISO/IEC Compliance Results of Improved (13,10) Q L, H S PPE (<=1) PMSE (<=0.06) OMSE (<=0.02) PME (<=0.015) OME ( <=OME<=0.0015) , e e e e , e e e e , e e e e-006 Proc. of SPIE Vol

121 , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e , e e e e-005 Note: Q-the number of input blocks; L, H-input data range [-L, H]; S-the sign of input data. The near-dc inversion test result is Mobile CIF 300 frames, QP=1 Encoder's IDCT = 64-bit floating-point IDCT 57.5 floating_ed 57 (13,10)e PSNR (Y) (13,10)d (13,10)ed Improved(13,10)e Improved(13,10)d Improved(13,10)ed Frame number Figure 4. MPEG-2 test results, (13,10) vs Improved (13,10) Proc. of SPIE Vol

122 47.5 Bus CIF 150 frames, QP=1, MPEG2 Quantization 1/4 pel enabled Encoder's IDCT = 64-bit floating-point IDCT 47 floating_ed PSNR (Y) (13,10)e (13,10)d (13,10)ed Improved(13,10)e Improved(13,10)d Improved(13,10)ed Frame number Figure 5. MPEG-4 test results, (13,10) vs Improved (13,10) 58 Foreman CIF 300 frames, QP=1 Encoder's IDCT = 64-bit floating-point IDCT 57 PSNR (Y) Float.Point IDCT (13,9,20) Improved (13,10) MPEG-2 TM Frame number Figure 6. MPEG-2 test results Proc. of SPIE Vol

123 49.5 Foreman CIF 300 frames, QP=1, MPEG-2 Quantization 1/4 pel disabled Encoder's IDCT = 64-bit floating-point IDCT 49.0 Float.point IDCT PSNR (Y) (13,9,20) Improved (13,10) 47.5 XVID IDCT Frame number Figure 7. MPEG-4 test results 4. COMPLEXITY AND ACCURACY ANALYSIS Different methods with different accuracy and complexity tradeoffs can be easily obtained by simply adjusting UP_SCALE with all multiplier-less implementations of the multipliers in the butterfly structures unchanged. This allows designers to choose one implementation that best meets the requirements of applications. For example, high accuracy transforms (with more operations) can be used for medical image compression and low accuracy transforms (with fewer operations) can be used for printed images. In Table 3 we summarize the results of operation counts with SCALE equal to 11, 13, 14 and 16. Note that SCALE=11 is the minimum integer value to pass all the accuracy tests, and SCALE=12 and SCALE=15 are not as effective as others in terms of operation count [7]. In Figure 8-11 we plot PMSE, OMSE, PME, OME (the worst case absolute error regardless of input range of random test or number of samples in random test in ISO/IEC ) versus UP_SCALE for different SCALEs respectively (the implementations in [4] were used). From these results we can see that, generally speaking, both parameters, SCALE and UP_SCALE affect the accuracy and higher accuracy needs larger, or at least not smaller SCALE or UP_SCALE. Also, the parameter SCALE determines the best level of accuracy that can be achieved (while large UP_SCALE can improve the accuracy when range of input data is small). Moreover, for one certain SCALE, there is a best corresponding UP_SCALE in terms of accuracy and complexity tradeoff, and from Figure 8-11, we can see that (13,10) is one of the most efficient methods among others. An interesting observation is that increasing SCALE can significantly reduce OMSE, and indeed very high accuracy can be achieved with large SCALE as shown in [4]. Proc. of SPIE Vol

124 Table 3. Operation Counts of Different SCALEs SCALE 1-D (not including rounding) 2-D (including rounding) additions + 44 shifts 1105 additions shifts additions + 46 shifts 1137 additions shifts additions + 51 shifts 1217 additions shifts additions + 58 shifts 1313 additions shifts 9.00E E E-02 PMSE ( 0.06) 6.00E E E E-02 SCALE=11 SCALE=13 SCALE=14 SCALE= E E E UP_SCALE Figure 8. PMSE vs UP_SCALE 1.60E E E-02 OMSE ( 0.02) 1.00E E E-03 SCALE=11 SCALE=13 SCALE=14 SCALE= E E E UP_SCALE Figure 9. OMSE vs UP_SCALE Proc. of SPIE Vol

125 9.00E E E-02 PME ( 0.015) 6.00E E E E-02 SCALE=11 SCALE=13 SCALE=14 SCALE= E E E UP_SCALE Figure 10. PME vs UP_SCALE 1.40E E-03 OME ( ) 1.00E E E E-04 SCALE=11 SCALE=13 SCALE=14 SCALE= E E UP_SCALE Figure 11. OME vs UP_SCALE 5. CONCLUSION In this paper, a straightforward multiplier-less approximation of the forward and inverse DCT with low complexity and high accuracy is proposed. The implementation, design methodology, complexity and performance tradeoffs are discussed. Particular, the proposed IDCT implementations, in spite of simplicity, comply with and can reach far beyond the MPEG IDCT accuracy specification ISO/IEC when suitable parameters are selected. Among others, (13,10) was identified as one of the most efficient methods. It also reduces drift favorably compared to other existing IDCT implementations like the ones used in MPEG-2 TM5, XVID etc with improvement of up to several db in PSNR. Proc. of SPIE Vol

126 REFERENCES [1] J. L. Mitchell, W. B. Pennebaker, D. LeGall, and C. Fogg, MPEG Video Compression Standard, Chapman & Hall: New York [2] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold: New York [3] ISO/IEC JTC1/SC29/WG11 N7815 [ FDIS] Information technology MPEG video technologies Part 1: Accuracy requirements for implementation of integer-output 8 8 inverse discrete cosine transform. [4] C. -X. Zhang, L. Yu, Low Complexity and High Fidelity Fixed-Point Multiplier-less DCT/IDCT Implementation Scheme, MPEG Doc. M12936, Bangkok, Thailand, Jan [5] O. Gustafsson, A. G. Dempster and L. Wanhammer, Extended results for minimum-adder constant integer multipliers, Proc. IEEE International Symposium on Circuits and Systems. (ISCAS), vol. 1, pp , May [6] C. -X. Zhang, J. Wang, L. Yu, Extended Results for Fixed-Point 8 8 DCT/IDCT Design and Implementation, MPEG Doc. M12935, Bangkok, Thailand, Jan [7] C. X. Zhang, J. Wang, L. Yu, Systematic Approach of Fixed Point 8 8 IDCT and DCT Design and Implementation, Picture Coding Symposium (PCS), April [8] C. Loeffler, A. Ligtenberg, C.S. Moschytz, Practical Fast 1D DCT Algorithm with Eleven Multiplications, Proc. ICASSP pp , [9] Y. A. Reznik, Considerations for Choosing Precision of MPEG Fixed Point 8x8 IDCT Standard, MPEG Doc. M13005, Bangkok, Thailand, Jan [10] ISO/IEC JTC1/SC29/WG11 N8981 [ FPDAM1] Information technology MPEG video technologies Part 1: Accuracy requirements for implementation of integer-output 8 8 inverse discrete cosine transform, AMENDMENT 1: Software for integer IDCT accuracy testing. [11] Y. A. Reznik, A. T. Hinds, C. Zhang, L. Yu, Z. Ni, Response to CE on Convergence of Scaled and Non-Scaled IDCT Architectures, MPEG Doc. M13650, Klagenfurt, Austria, July [12] Y. Voronenko and M. Puschel, Multiplierless Multiple Constant Multiplication, ACM Transactions on Algorithms vol 3, issue 2, May [13] MPEG-2 TM5 source code: ftp://ftp.mpegtv.com/pub/mpeg/mssg/mpeg2vidcodec_v12.tar.gz [14] XVID open source implementation of MPEG-4 ASP: [15] A. T. Hinds, Y. A. Reznik, P. Sagetong, On IDCT Exactness, Precision, and Drift Problem, MPEG Doc. M13657, Klagenfurt, Austria, July [16] Z. Ni, C. Zhang, L. Yu, Updated MPEG-2 Testbed and Drift Test Results, MPEG Doc. M13599, Klagenfurt, Austria, July [17] A. T. Hinds and J. L. Mitchell, A Fast and Accurate Inverse Discrete Cosine Transform, Proceedings of the IEEE Workshop on Signal Processing Systems Design and Implementation, pp , November Proc. of SPIE Vol

127 Publication 4 Cixun Zhang, Lu Yu, Jian Lou, Wai-Kuen Cham, Jie Dong, The Technique of Pre-scaled Integer Transform: Concept, Design and Applications IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 18, Issue. 1, Jan 2008, pp

128

129 84 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 The Technique of Prescaled Integer Transform: Concept, Design and Applications Cixun Zhang, Lu Yu, Member, IEEE, Jian Lou, Student Member, IEEE, Wai-Kuen Cham, Senior Member, IEEE, and Jie Dong Abstract Integer cosine transform (ICT) is adopted by H.264/AVC for its bit-exact implementation and significant complexity reduction compared to the discrete cosine transform (DCT) with an impact in peak sigal-to-noise ratio (PSNR) of less than 0.02 db. In this paper, a new technique, named prescaled integer transform (PIT), is proposed. With PIT, while all the merits of ICT are kept, the implementation complexity of decoder is further reduced compared to corresponding conventional ICT, which is especially important and beneficial for implementation on low-end processors. Since not all PIT kernels are good in respect of coding efficiency, design rules that lead to good PIT kernels are considered in this paper. Different types of PIT and their target applications are examined. Both fixed block-size transform and adaptive block-size transform (ABT) schemes of PIT are also studied. Experimental results show that no penalty in performance is observed with PIT when the PIT kernels employed are derived from the design rules. Up to 0.2 db of improvement in PSNR for all intra frame coding compared to H.264/AVC can be achieved and the subjective quality is also slightly improved when PIT scheme is carefully designed. Using the same concept, a variation of PIT, Post-scaled Integer Transform, can also be potentially designed to simplify the encoder in some special applications. PIT has been adopted in audio video coding standard (AVS), Chinese National Coding standard. Index Terms Adaptive block-size transform (ABT), audio video coding standard (AVS), complexity reduction, discrete cosine transform (DCT), H.264/AVC, integer cosine transform (ICT), prescaled integer transform (PIT), standard, transform, video coding. I. INTRODUCTION INTEGER cosine transform (ICT) was first introduced by W. K. Cham in 1989 [1] and is further developed in recent years. It has been proved that some ICTs have almost the same compression efficiency as the Discrete Cosine Transform (DCT) but much simpler implementation because only additions and shifts operations are needed [2]. Moreover, ICT can avoid inverse Manuscript received September 6, 2005; revised February 14, This work was supported by Natural Science Foundation of China under Grant and Grant This paper was recommended by Associate Editor I. Ahmad. C. Zhang was with Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China. He is now with the Institute of Signal Processing, Tampere University of Technology, Tampere FIN-33101, Finland. L. Yu is with the Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China ( yul@zju.edu.cn). J. Lou was with the Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China. He is now with the Department of Electrical Engineering, University of Washington, Seattle, WA USA. W.-K. Cham is with the Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong. J. Dong was with the Institute of Information and Communication Engineering, Zhejiang University, Hangzhou , China. She is now with the Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT transform mismatch problems of the DCT. Due to these advantages, the latest international video coding standard H.264/AVC adopted order-4 and order-8 ICT transforms [3], [4]. In this paper, a technique called prescaled integer transform (PIT) is proposed. PIT can further reduce the implementation complexity of corresponding ICT while no penalty in performance is observed. The paper is organized as follows. Fundamentals of the DCT and ICT are first reviewed in Section II. Then the concept of PIT is introduced and examined in detail in Section III. The proposed PIT scheme and its benefits compared to conventional ICT scheme such as that in H.264/AVC are elaborated. In Section IV, the design rules lead to good PIT kernels in respect of performance are considered. We also find that different types of PITs have different characteristics and are suitable for different applications. Experimental results and analysis of fixed block-size transform (FBT) scheme of PIT are given in Section V. Besides, in Section VI, adaptive block-size transform (ABT) schemes of PIT are also studied and corresponding experimental results and analysis are presented. Section VII concludes the paper. II. FUNDAMENTALS OF THE DISCRETE COSINE TRANSFORM AND INTEGER COSINE TRANSFORM The forward and inverse DCT [5] are defined as (1) (2) where and stand for the original input matrix and the DCT coefficient matrix while serves as both forward and inverse DCT kernels. ICT originates from the DCT and was derived using the principle of dyadic symmetry [1]. The forward and inverse ICT are defined as (3) (4) where is the ICT coefficient matrix and is the ICT kernel. The ICT kernel includes two parts, and [1]. For the same ICT, the choice of and is not unique and can be represented as (5) where the, used in forward transform are denoted as, and the, used in inverse transform are denoted as,, respectively. The properties of,,, and their relationship with each other are described in the following paragraphs and will be frequently used in this paper. In this paper, we will concentrate on the order-4 and order-8 transforms since they are the most useful ones in practical applications and the discussions can be easily extended to transforms /$ IEEE

130 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 85 other than order-4 and order-8. Traditionally, order-8 transforms have been used for image and video coding. The order-8 size has the advantages of being large enough to capture the redundancy while being small enough to provide good adaptation to the inhomogeneity inside an image. However, small size transform such as order-4 transform has the advantages of reducing ringing artifacts at edges and discontinuities [2] [4]. Because of these reasons, H.264/AVC uses both order-4 and order-8 ICTs [4]. For an order-8 ICT kernel specified as [a,b,c,d;e,f;g] in this paper, and in (5) are defined as follows where subscript 8 is used to indicate the transform size is 8 8: Fig. 1. Block diagram of conventional ICT scheme in H.264. order-4 and order-8 ICTs, thus avoiding inverse transform mismatch problems. In H.264/AVC, on encoder side, using (5), the forward transform and quantization process can be represented as (11). For convenience, we do not take the practical rounding operation in the quantization process into account and here is a real number matrix rather than an integer matrix (6) and is an 8 8 diagonal matrix with its th diagonal elements such that where is the th row vector of. The values of,,,,,, and in an ICT should be integers. However, they are sometimes expressed as rational numbers by suitably adjusting. For example, in H.264/AVC, the choice of,,,,, and is 12/8, 10/8, 6/8, 3/8, 1, 4/8, and 1, which can be regarded the same as 12, 10, 6, 3, 8, 4, and 8, respectively. Similarly, for an order-4 ICT kernel specified as [a,b;c] in this paper, and in (5) are defined as follows where subscript 4 is used to indicate the transform size is 4 4: where, and are integers and is a 4 4 diagonal matrix with its th diagonal elements such that where is the th row vector of.in H.264/AVC, the values of,, and in the inverse transform are expressed as 1, 1/2, 1, which can be regarded the same as choosing, and as 2, 1, 1, respectively. In the rest of this paper, for convenience, we use the notation, which is a column vector, to represent the main diagonal of an matrix, i.e., (8) Two operators and used in this paper are defined as follows. When,, are all matrices, means (9) and similarly, means (10) (7) (11) where is the input matrix; is the quantized transformed coefficient matrix; is the quantization matrix on encoder side and the dequantization matrix on decoder side, which is dependent on quantization parameter (QP), when uniform quantization is used, can be simply replaced by QP; is the forward scaling matrix; and is the quantization-scaling matrix On decoder side, the corresponding inverse transform and dequantization process is (12) where is the reconstructed matrix theoretically is equal to ; is the inverse scaling matrix; and is the dequantization-scaling matrix. Equations (11) and (12) above represent the merging of the forward/inverse scaling and the quantization/dequantization operations into one step so as to reduce the computational complexity as in H.264/AVC. Fig. 1 is the block diagram for the conventional ICT scheme. Note that theoretically, in (11) and in (12) normally contain non-integer elements. However, in actual video coding standards, these are usually implemented using multiplications and shifts to reduce the complexity. For example, in H.264/AVC, in (11) and in (12) contain only integer elements and right-shift is applied to the results of (11) and (12). B. Proposed Prescaled Integer Transform Scheme Let us take order-4 transform as an example first. In H.264/ AVC, the QP period is 6 and a dequantization-scaling matrix is used for 4 4 ICT [2], [3], [32]. III. CONCEPT OF PRESCALED INTEGER TRANSFORM A. Conventional Integer Cosine Transform Scheme Unlike the popular order-8 DCT used in previous standards, such as MPEG1/2/4, H.261 and H.263, H.264/AVC employs (13)

131 86 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 For the th transformed coefficient in a block, the search rule for corresponding dequantization-scaling element in is defined by for equal to for equal to (14) else where % is modulus operator and x%y means remainder of x divided by y, defined only for integers x and y with and. The main problem here is that for every transformed coefficient, we need to conduct a 3-D operation, which uses QP and the coordinates of the coefficients in the block, to search for the corresponding elements in. Some computations as in (14) are also needed. In order to reduce the computational complexity, and at the same time facilitate parallel processing and take advantage of the efficient multiply/accumulate architecture of many processors, we can fully expand the to ',as in (15) and correspondingly the search rule is simplified to (16). However, the storage will increase from bytes to bytes. This memory size will be much larger when we also take 8 8 ICT into account. In that case, a total memory of bytes should be allocated. Fortunately, in fact, the required memory size and computational complexity can be reduced at the same time if the technique of PIT is used. The concept of PIT is first proposed by the authors in [6] and will be elaborated below Let (18) which will be called combined-scaling-quantization matrix in the following. Then the forward transform and quantization process can be represented as In order to derive the PIT kernel, we have and, therefore, the forward PIT kernel is Correspondingly, the inverse PIT kernel is (19) (20) (21) (22) From (17) (19) above, the inverse transform and dequantization process can be represented as (23) (15) From (23), we can also derive the theoretical inverse PIT kernel, which is the same as in (22). Based on the derivation above, theoretically we can define the forward and inverse PIT as Substituting (11) into (12), we get (16) (17) (24) (25) However, it should be noted that similar to the case of ICT shown in (11) and (12), we always implement PIT using (19) and (23) instead of using (24) and (25) directly. In the rest of this paper, for order-4 and order-8 ICT kernels [a,b;c] and [a,b,c,d;e,f;g], the corresponding order-4 and order-8 PIT kernels are denoted as [a,b;c] and [a,b,c,d;e,f;g], respectively. The main idea of PIT is that inverse scaling is moved to encoder side and combined with forward scaling and quantization as one single process. The fact that no scaling is needed on decoder side anymore distinguishes PIT from conventional ICT, and this is exactly the reason why PIT can reduce the required memory size and computational complexity on decoder side at

132 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 87 Fig. 2. Block diagram of proposed PIT scheme. the same time. The block diagram for the PIT scheme is shown in Fig. 2. When PIT is used, the dequantization-scaling matrix (or more accurately, the dequantization matrix, since no scaling is included any more) is as follows: (26) And for the th transformed coefficient in a block, the search rule for corresponding element in is further simplified to (27) Comparing (26) and (27) to (13) and (14), and (15) and (16), respectively, we can clearly see that the decoding complexity is reduced with PIT. First, if the dequantization-scaling matrix is fully expanded, when order-4 ICT is used, a total memory of bytes can be replaced by a memory of only 6 bytes. The memory required when using PIT is much lower than that required using conventional ICT. Otherwise, if the dequantization-scaling matrix is not expanded, a total memory of bytes can be reduced to 6 bytes. Though this saving is trivial considering only order-4 ICT, keeping in mind that in this case every non-zero coefficient needs a lookup operation and some extra operations when using ICT, the computational complexity of PIT will be lower because 3-D lookup operation is replaced by 1-D lookup operation which only uses QP, and at the same time no other extra operations as in (14) is needed. In either case, on decoder side, required memory size can be reduced and 3-D lookup operation can always be replaced by 1-D lookup operation thus pipeline and parallel processing is facilitated and extra computation or storage memory can be saved. While on encoder side, comparing (11) to (19), we can find that since forward scaling, inverse scaling and quantization are combined as one single process and the original quantization-scaling matrix is replaced by combined-scaling-quantization matrix with the same size, both the computational and storage complexity remain unchanged. We only consider order-4 transform in the discussion above. When both order-4 and order-8 transforms are used, a total memory of bytes can be saved and a memory of only 6 bytes is needed instead, assuming that the order-4 and order-8 PITs employed are compatible, which will be discussed in more detail in Section VI. Further, besides H.264/AVC, in some other video coding standards like AVS part 2 [12], which is Chinese National Coding Standard for digital TV Broadcasting and HD-DVD, the scheme of periodic QP is not used in order to reduce computational complexity. In this case, PIT can provide even larger saving of memory. Only a memory of 64 bytes is needed instead of a memory of bytes because QP range 0 63 and order-8 PIT is employed in AVS part 2. In order to save memory, the scaling and quantization/dequantization are separated in AVS part 2. In this case, a memory of bytes for inverse scaling matrix can be saved and at the same time the computational complexity is reduced when PIT is used because no scaling is needed on decoder side any more. C. Post-Scaled Integer Transform A variation of PIT might be post-scaled integer transform, in which the scaling of the forward transform can be moved to the decoder side. This can be potentially used to simplify the encoder in some kinds of applications. The issues of postscaled integer transform are similar to those of PIT. In this paper, we consider only PIT and similar analysis approaches can be carried out for post-scaled integer transform. D. Fixed-Point Error Analysis of Prescaled Integer Transform In this part, we first summarize our previous work on fixedpoint error analysis of ICT and PIT, and then give some comparison between them. In [7], an analysis of the statistical error behavior due to rounding of the transformed coefficients for the DCT and ICT is presented. The analysis considers a system which forwardly transforms pixels into coefficients and then inversely transforms back into reconstructed pixels. The transform kernel is assumed to be implemented precisely but the transformed coefficients are represented using finite number of bits and so errors are generated. Both theoretical and experimental results show that [10,9,6,2;9,3;8] generates a smaller mean square error between the original and reconstructed pixels than the DCT when the same number of bits is used for the representation of the transformed coefficients. When the number of bits increases, such mean square error decreases. It approaches zero faster in the case of ICT than the DCT. In [8], similar theoretical analysis and experiments are reported for [10,9,6,2;10,4;8] which is also found to generate a smaller mean square error (MSE) between the original and reconstructed pixels than the DCT when the same number of bits is used for the representation of the coefficients. Fig. 3 shows the experiment results for 1-D and 2-D transform systems. The image data are integers generated randomly with uniform distribution between [ 128,127]. Also, it is assumed that the 2-D transform introduces rounding error due to the use of finite number of bits for the 2-D transform coefficients at the output only and no rounding error is introduced inside the 2-D transform. The results in [7] and [8] were obtained under the assumption that and are implemented precisely without error. Under this condition, we would expect that PIT and the corresponding ICT have the same performance. However, it is interesting to note that the implementation of cannot be exact because it contains irrational numbers but when PIT is used the implementation of can be exact especially when which is usually true because it contains rational numbers. Furthermore, if the PIT kernel is selected carefully, e.g., [2,1;1], [3,1;2], when, can be represented using few bits. In this case, lossless coding of PIT coefficients can be achieved easily. For the case of ICT, however, lossless coding of transformed coefficients is difficult. For example, in H.264/AVC, only lossless coding of residue is supported [4].

133 88 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 Fig. 3. MSE of reconstructed pixels versus coefficient bit length in 1-D image vector and 2-D image data block [8]. Left: MSE of reconstructed pixels versus coefficient bit length in 1-D image vector. Right: MSE of reconstructed pixels versus coefficient bit length in 2-D image data block. IV. DESIGN OF PRESCALED INTEGER TRANSFORM Just like ICT, not every PIT kernels performs well. It is practically important to find design rules that lead to good PIT kernels. For convenience of discussion, (21) and (22) are rewritten here as (28) and (29) (28) (29) We can see that is constructed by and while is constructed by and. To choose good PIT kernels, we should take these two factors into account. It turns out that considering compression ability, we should choose a good and correspondingly while considering bit representation of transformed coefficients we should choose a good and correspondingly. In the following subsections, these two factors are examined in detail and based on the analysis, design rules are obtained. A. Consideration of Compression Ability In (28) and (29), and can be regarded as scaling factors applied to basis vectors of corresponding forward and inverse ICT kernels to construct corresponding PIT kernels. Since is diagonal, we can see that the normalized basis vectors of forward and inverse PIT kernels are the same as the corresponding normalized basis vectors of forward and inverse ICT kernels, respectively. This observation suggests that PIT kernels and corresponding ICT kernels have similar compression ability. In the following, transform coding gain [9] and DCT frequency distortion [10] are studied. 1) Transform Coding Gain: Strictly speaking, PIT is not an orthogonal transform, thus we should use biorthogonal transform coding gain [9] here. Biorthogonal transform coding gain measures the energy compacting ability of the transform in the transform coding system. Assume that the input vector is transformed into coefficient vector by an order-n forward transform, i.e., (30) in the transform do- then the covariance matrix of the vector main is given as (31) where is the expectation operator. If the input signal is modeled by a zero mean, first-order autoregressive (AR(1)) process which is characterized by the correlation coefficient, then the covariance matrix of the input sampled vector x has a form of Toeplitz matrix, i.e., is equal to (32) In this situation, the biorthogonal transform coding gain of a given transform is a function of the correlation coefficient and can be calculated analytically as follows [9]: (33) where is the norm of the th inverse transform basis vector. Note that (33) can also be used for orthogonal transform such as conventional ICT. Although and are often used as a measure of the compression ability of the given transform kernel, in this paper, we define (34) as the measure of the compression ability of a given transform, in order to take into account different correlation coefficients of the input data. This is because in advanced video coding standards such as H.264/AVC and AVS, the input data fed into the transform is the residue data after intra or inter prediction, and the correlation will be less than 0.9 in most cases [2]. Generally, larger indicates better compression ability of the given transform kernel.

134 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 89 2) DCT Frequency Distortion: DCT frequency distortion [10] measures the frequency distortion of a given transform with respect to the DCT. For order- transform with transform kernel, the first-order frequency distortion and the second-order frequency distortion are defined as (35) Equation (41) shows that PIT kernels are the same as corresponding ICT kernels in terms of transform coding gain. For DCT frequency distortion, from (28) and (37), we have Since is a diagonal matrix, it is easy to see (42) and (36) (43) respectively, where is calculated as (37) 3) Comparison of Prescaled Integer Transform and Integer Cosine Transform: In this subsection, we will discuss about the relationship between PIT kernel and its corresponding ICT kernel in terms of transform coding gain and DCT frequency distortion. Fortransformcodinggain,from(33),wecanfindthatthetransform coding gain of one certain ICT kernel and its corresponding PIT kernel are given by (38) and (39) below, respectively (38) (44) Equations (43) and (44) show that PIT kernels have the same DCT frequency distortion with corresponding ICT kernels. So basically, we can and we should use good ICT kernels to obtain good PIT kernels. B. Consideration of Weighting Factors of Transformed Coefficients Besides compression ability, because the forward PIT kernel is not normalized, it is important to analyze the effect of which is regarded as scaling factors applied to basis vectors of corresponding forward ICT kernels. From (24) and (21), we can get (39) And we have so (40) (45) It shows that the PIT coefficient matrix can be regarded as a weighted ICT coefficient matrix where is the weighting matrix. While all coefficients of an orthogonal ICT have the same maximum and minimum values, those of PIT do not because the weighting matrix changes the relative magnitudes of the transformed coefficients. We define the weighting factor difference (WFD) of a PIT to represent the difference of weighting factors applied to different transformed coefficients (46) (41) The deviation of the WFD of a PIT away from unity may cause a problem. It may result in truncation of some transformed coefficients that could be retained if where is scalar and is a matrix with all its elements equal to 1. However, unfortunately the condition can not always be true when PIT is used. In order not to change the transformed coefficients too much so as to retain the compression ability of corresponding ICT kernels, elements of the weighting matrix

135 90 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 should have values close to one another. Another advantage of this consideration is that the change from ICT to corresponding PIT will be small so that other parts of a codec need not be redesigned: the scan order of the transformed coefficients need not be changed and the entropy coding table designed for the ICT can also be reused for the PIT without much change. Besides, it is also worth noting here that compensation for the scaling effect of can also be potentially achieved by using customized quantization matrix, but additional complexity would be introduced. Generally, according to our experience, the following condition should be met TABLE I TEST CONDITIONS AND ENCODING PARAMETERS FOR ORDER-4 ICT AND PIT TABLE II TEST CONDITIONS AND ENCODING PARAMETERS FOR ORDER-8 ICT AND PIT (47) That is, the bit-width difference between largest element and smallest elements in should be no larger than 1/2. One could expect that more sophisticated measures using all the elements in rather than only the maximum and minimum in (46) will give a more precise evaluation, however, (47) already works rather well, which is justified by the experiment results in Section V. Also, since many PITs with WFD less than are already obtained and demonstrated to show good performance therein, there may be no need to search for PITs with WFD larger than because the smaller the WFD is, the closer the PIT is to the corresponding ICT. C. Design Rules Based on the discussion above, we can conclude that we can obtain a good PIT kernel following the steps below. 1) Obtain a good ICT kernel. To derive good ICT kernels, computer search can be systematically carried out based on transform coding gain and DCT frequency distortions. 2) Choose a good ICT kernel whose WFD is not larger than. D. Types of Prescaled Integer Transform Besides the design rules derived above, we have found that different types of PITs have different characteristics and are suitable for different applications, which is also an important issue in practical. There are generally two types of PITs, based on frequency scale factor (FSF). For order- PIT, FSF measures the scaling effect of on the transformed coefficients and is defined as (48) 1) Type-I PITs, whose FSF is less than 1, have a characteristic that more high-frequency components may be quantized out compared to corresponding ICT. Type-I PITs may be more suitable for video streaming and conferencing applications where low resolution (such as QCIF and CIF) video sequences are usually used and coded at relatively low bit rates because Type-I PITs lead to bit savings without degrading the subjective quality significantly in this situation. 2) Type-II PITs, whose FSF is larger than or equal to 1, have a characteristic that more high-frequency components may be retained after quantization compared to corresponding ICT. Type-II PITs may be more suitable for entertainment quality and other professional applications where higher resolution (such as HD) video sequences are often used and coded at relatively high bit rates because chance is higher that more detailed texture will be preserved. V. EXPERIMENTAL RESULTS OF FIXED BLOCK-SIZE TRANSFORM SCHEME WITH PRESCALED INTEGER TRANSFORM In order to test performance of different PIT kernels, extensive experiments have been done based on JM 7.6 and JM 9.3 [28]. The test conditions are listed in Tables I and II. The experiments on order-4 ICTs and PITs target video streaming and conferencing applications, so relatively large QPs and QCIF and CIF sequence are used. The experiments on order-8 ICTs and PITs target entertainment-quality and other professional applications so relatively small QPs and HD sequences are used. In all the experiments, context-based adaptive binary arithmetic coding (CABAC) [3], [4], [29] is used and number of reference frames are set to two. Figs. 4 and 5 give the plot of transform coding gain and transform coding gain difference compared with the DCT of different order-4 and order-8 ICTs and PITs where correlation coefficient is in the range [0,1], while Tables III and IV give the transform coding gain,, and DCT frequency distortion of different order-4 and order-8 ICTs and PITs. All the order-4 and order-8 ICTs and PITs chosen in our experiments have comparable transform coding gains with that of the DCT and little DCT frequency distortion, so they are expected to have good compression performance which has been proved by the experimental results given in Tables V VIII. The experimental results are presented in the form of average peak signal-to-noise ratio (PSNR) gains using the method proposed in [11]. Some rate-distortion curves are given in Figs. 6 and 7, respectively. Among all the ICTs and PITs in our experiments, [1,1/2;1] ( [2,1;1]) and are used in H.264/AVC, [10,9,6,2;9,3;8] was proposed in [1] because of its high decorrelation ability and relatively low complexity. [10,9,6,2;10,4;8] is adopted in AVS part 2 [12] because its implementation complexity is similar to

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 91 Fig. 4.

gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.

Left: Transform coding gain of order-8 DCT and different order-8 ICTs Right: Transform coding gain difference of different order-8 ICTs compared with order-8 DCT (Transform coding )

TABLE III TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-4 ICTS AND PITS TABLE IV TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-8 ICTS

136 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 91 Fig. 4. Left: Transform coding gain of order-4 DCT and different order-4 ICTs Right: Transform coding gain difference of different order-4 ICTs compared with order-4 DCT (Traform coding gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.) Note that the transform coding gain of these ICTs and the DCT are nearly the same and the curves in the left figure overlap. Fig. 5. Left: Transform coding gain of order-8 DCT and different order-8 ICTs Right: Transform coding gain difference of different order-8 ICTs compared with order-8 DCT (Transform coding gain difference larger than 0 means that the ICT has better energy compacting ability than the DCT, and vice versa.) Note that the transform coding gain of these ICTs and the DCT are nearly the same and the curves in the left figure overlap. TABLE III TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-4 ICTS AND PITS TABLE IV TRANSFORM CODING GAIN AND DCT FREQUENCY DISTORTION OF DIFFERENT ORDER-8 ICTS AND PITS [10,9,6,2;9,3;8] but has a slightly higher transform coding gain, which is shown in Fig. 5 and Table IV. [3,1;2], [5,2;4], [9,4;7] were proposed in AVS part 7 (also known as AVS-M, where M represents mobility.), which targets applications of video communications on mobile devices, with [3,1;2] finally adopted for its simplicity and satisfactory

137 92 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 TABLE V COMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 TRANSFORM MATRIXCES USING ICT COMPARED TO H.264/AVC TABLE VI COMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC TABLE VII COMPRESSION PERFORMANCE OF DIFFERENT ORDER-8 TRANSFORM MATRIXCES USING ICT COMPARED TO H.264/AVC TABLE VIII COMPRESSION PERFORMANCE OF DIFFERENT ORDER-8 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC performance [12]. [17,7;13] is used in original H.264 design (also known as H.26L) [2] and has an advantage of having same basis vector norms, In this case, [17,7;13] is in fact the same as [17,7;13]. Similar technique as PIT with [22,10;17] and [16,15,9,4;16,6;12] are used in WMV9/VC-1 [30], [31]. However, the elements in component in these two transform kernels are relatively larger in order to meet the conditions mentioned in [30] which are much

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 93 Fig. 6. Rate-distortion curves of different order-4 PITs for news sequence (CIF, 30 fps) and foreman sequence (QCIF,15 fps). Fig. 7.

stricter than (47) and thus complexity is increased and accuracy is somewhat lost when implemented in 16 bit arithmetic. Moreover, [22,10;17] and [16,15,9,4;16,6;12] are not completely compatible.

138 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 93 Fig. 6. Rate-distortion curves of different order-4 PITs for news sequence (CIF, 30 fps) and foreman sequence (QCIF,15 fps). Fig. 7. Rate-distortion curves of different order-8 PITs for harbour sequence (HD, IBBP, 60 fps) and night sequence (HD, IPP, 30 fps). stricter than (47) and thus complexity is increased and accuracy is somewhat lost when implemented in 16 bit arithmetic. Moreover, [22,10;17] and [16,15,9,4;16,6;12] are not completely compatible. Here compatible means that different transforms can be implemented in only one transform unit rather than multiple: they can share the same forward/inverse transform butterfly structure and quantization-scaling/ dequantization-scaling matrix [21]. The following conclusions can be drawn based on the experimental results. 1) From Tables III and IV, we can see that [0,1] gives a very good estimation of the energy compacting ability of a given ICT or PIT, which is justified by the experimental results in Tables V VIII. Generally, and may be also good measures. But from Figs. 4 and 5 we can see that using a fixed correlation coefficient point may not be appropriate because transform coding gain difference varies for different correlation coefficients. DCT frequency distortion may also serve as an estimation of the compression performance of a given ICT or PIT, but it treats the distortion of each basis vectors equally, where a weighting should be considered intuitively. Further, distortion from the DCT might not necessarily be bad. Besides these two theoretical criterions, we should note that larger elements in component of the ICT or PIT kernel will suffer additional loss in PSNR when implemented in 16-bit arithmetic. 2) From Tables V VIII, we can see that PIT performs generally as well as the corresponding ICT if (47) is met. Generally, the smaller the WFD of the given PIT is, the smaller the difference with the corresponding ICT in PSNR is, which is expected. 3) According to the experimental results in Tables VI and VIII, Type-I PITs lead to lower bit rate and lower PSNR, while Type-II PITs result in higher bit rate and higher PSNR at the same QP, compared to corresponding ICTs. This is due to the weighting of the frequency components. The larger the FSF of the given PIT is, the higher the bit rate and PSNR are compared to corresponding ICT. 4) From Table VI, we can find that [3,1;2] performs as well as or even better than some other PITs though its transform coding gain is a little lower and its DCT frequency distortion is a little larger. This is mainly because no right shift is needed when it is implemented in 16 bit arithmetic, thus precision is retained. [3,1;2] belongs to Type-I PITs and is suitable for low fidelity video coding. It is also the simplest PIT kernel we can find that meets (47). Due to its low complexity and satisfactory performance, [3,1;2] is adopted by AVS part 7. 5) [10,9,6,2;10,4;8] has high transform coding gain and approximates the DCT very well. It outperforms other PITs as shown in Table VIII. [10,9,6,2;10,4;8] is a Type-II PIT and is fit for high fidelity video coding. Due to its good performance and favorable complexity reduction, [10,9,6,2;10,4;8] is adopted by AVS part 2. VI. ADAPTIVE BLOCK-SIZE TRANSFORM SCHEME WITH PRESCALED INTEGER TRANSFORM The fixed block-size transform scheme of PIT has been examined in detail in previous sections. In this section we study the ABT scheme of PIT. ABT was once introduced in H.26L but was removed later because of its high implementation complexity. However, ABT can not only improve the coding efficiency significantly [13], but can also provide subjective bene-

139 94 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 TABLE IX COMPRESSION PERFORMANCE OF DIFFERENT ORDER-4 AND ORDER-8 TRANSFORM MATRIXCES USING PIT COMPARED TO H.264/AVC fits, especially for HD movie sequences from the point of subtle texture preservation [14], such as keeping film details and grain noise which are very crucial to subjective quality especially for film contents [15]. Due to this, ABT has been considered again and adopted in fidelity range extensions (FRExt) of H.264/AVC with the major concern of complexity reduction [16] [20]. Compatible ABT is one major consideration during the H.264/AVC FRExt standardization work and the order-8 ICT adopted in H.264/AVC is especially chosen to be compatible with the existing order-4 ICT [17]. While it is well known that [a,b,c,d;e,f;g] is compatible with [e,f;g], it is also easy to understand that the corresponding PIT kernel [a,b,c,d;e,f;g] is also compatible with [e,f;g] because there is no need to change the butterfly structures when PIT is used and it is shown in [21] that [e,f;g] can also reuse the combined-scaling-quantization matrix of [a,b,c,d;e,f;g]. It has been mentioned that keeping film grain in film contents is crucial to subjective quality especially for HD sequence. This has also been paid much attention to during the H.264/AVC FRExt standardization work [22] [25]. As analyzed in [22], film grain preservation is always a challenge for encoders because the film grain is in nature temporally uncorrelated and thus large compression gains brought by temporal prediction can not be exploited efficiently. As a result, most of the film grain remains in the prediction error and appears as small transformed coefficients at high frequencies in the DCT domain and thus is typically quantized out with other noise for a wide range of QP values. Even at high bit rates, film grain can only be encoded and preserved at a high compression cost. In order to allow encoding film grain at lower bit rates more efficiently, H.264/AVC uses a standardized film grain characteristics supplemental enhanced information (SEI), which allows encoder to generate a parameterized model of film grain statistics instead of encoding the exact film grain, and send it along with the video data to decoder, to provide the information for film grain synthesis as a post process when decoding [4], [22], [23], [25], [32]. At lower bit rates, the film grain characteristics SEI message has been proved to be a good tool to improve picture quality for sequences with film grain. However, it was reported that SEI message is not accepted by the movie industry where higher target bit rates are used mainly because it is very difficult and even impossible to reach transparent picture quality for any kind of input sequence and have full control on the decoded sequence [26]. From the viewpoint of keeping details, Type-II PITs are more suitable because film grain and other subtle texture that get quantized out have a higher chance of survival. From the experimental results in Section V, it can be inferred that a compatible ABT scheme using [5,2;4] and is promising because both of them have high compression efficiency and are Type-II PITs. Extensive experiments have been done to evaluate the performance of [5,2;4] and. The objective results compared with H.264/AVC high profile and its ICT counterpart ( [5,2;4] and.) are given in Table IX. The operational rate-distortion curves are shown in Fig. 8 and the subjective quality comparison for stockholm sequence 1 is presented in Fig. 9. The test conditions are the same as those listed in Table II except that both order-4 and order-8 transforms are used. For subjective quality comparison, rate control is used and the sequence is encoded at 6Mbit/s and IBBP structure is used. Based on the experimental results in Table IX, the following conclusions can be drawn. 1) Both ABT schemes, [5,2;4] and, [5,2;4] and, outperform the one in H.264/AVC. Three GOP structures have been tested, i.e., IPPP, IBBP, Intra frame only. We can see that the gain in PSNR from IPPP and IBBP is only less than 0.05 db on average, which is mainly because of high efficiency of the motion compensation and it can be expected the gain will be larger if the number of reference frames is reduced to one instead of two and smaller search range for motion 1 [Online]. Available: ftp://ftp.ldv.e-technik.tu-muenchen.de/pub/test_sequences

Fig. 9. Subjective quality for stockholm sequence (720p, IBBP, 30 fps, 6 Mbps, 26th frame, local) top-left: original, top-right: PIT [5; 2; 4] + PIT [10; 9; 6; 2; 10; 4; 8]=2, 34.

140 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 95 Fig. 8. Rate-distortion curves of H.264/AVC, ICT [5; 2; 4] + ICT [10; 9; 6; 2; 10; 4; 8]=2, PIT [5; 2; 4] + PIT [10; 9; 6; 2; 10; 4; 8]=2 for harbour sequence (1) HD, IBBP, 60 fps; (2) HD, IPP, 30 fps; (3) HD, intra-frame only, 30 fps. Fig. 9. Subjective quality for stockholm sequence (720p, IBBP, 30 fps, 6 Mbps, 26th frame, local) top-left: original, top-right: PIT [5; 2; 4] + PIT [10; 9; 6; 2; 10; 4; 8]=2, db, bottom-left: ICT [5; 2; 4] + ICT [10; 9; 6; 2; 10; 4; 8]=2, db, bottom-right: H.264/AVC, db estimation is used in our experiments. However, the gain in PSNR from Intra frame is not trival which can be almost 0.2dB on average keeping in mind that H.264/AVC does not rely much on transform for decorrelation [27]. Noting that in real entertainment-quality and other professional applications such as DVD-Video systems and HDTV where HD sequences are generally used, frequent periodic intra coded pictures are typical in order to enable fast random access, the ABT schemes, [5,2;4] and, [5,2;4] and are good in terms of coding efficiency. 2) The subjective results in Fig. 9 show that the ABT scheme of [5,2;4] and is slightly better than both its ICT counterpart and that of H.264/AVC. More details are preserved. It should be noted that in our implementation of the ABT scheme of [5,2;4] and, on the decoder side, a memory of only 6 bytes is allocated to store the matrix ' which is used for dequantization (49) and, therefore, only 1-D lookup operation is needed. Thus, good performance and complexity reduction are achieved at the same time when the ABT scheme of [5,2;4] and is used.

141 96 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 1, JANUARY 2008 VII. CONCLUSION In this paper, a new technique, named PIT, is proposed to further reduce the implementation complexity of conventional ICT scheme such as that adopted in H.264/AVC without any sacrifice in performance. Since not all PIT kernels perform well, design rules that lead to good PIT kernels are considered in this paper. It is also found that different types of PIT have different characteristics and are suitable for different applications. PIT has been adopted in AVS, Chinese National Coding standard with [10,9,6,2;10,4;8] in AVS part 2 and [3,1;2] in AVS part 7, respectively, due to its implementation complexity reduction and good performance. Besides fixed block-size transform scheme, compatible ABT scheme of PIT are also studied in this paper because of its higher coding efficiency and subjective benefits. A compatible ABT scheme using [5,2;4] and is proposed and it is shown that up to 0.2 db in PSNR for all intra frame coding can be achieved compared to the counterpart in H.264/AVC. Besides, subjective quality is also slightly improved because more subtle texture can be preserved. Using the same concept, a variation of PIT, post-scaled integer transform, can also be potentially used to simplify the encoder in some kinds of applications. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their useful comments to improve this paper. REFERENCES [1] W. K. Cham, Development of integer cosine transforms by the principle of dyadic symmetry, Proc. IEE, I, vol. 136, no. 4, pp , Aug [2] H. Malvar, A. Hallapuro, M. Karczwicz, and L. Kerofsky, Lowcomplexity transform and quantization in H.264/AVC, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp , Jul [3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp , Jul [4] G. J. Sullivan, P. N. Topiwala, and A. Luthra, The H.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions, in Proc. SPIE, Appl. Dig. Image Process. XXVII, Aug. 2004, vol. 5558, pp [5] N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform, IEEE Trans. Comput., vol. C-23, no. 1, pp , Jan [6] C.-X. Zhang, J. Lou, L. Yu, J. Dong, and W. K. Cham, The technique of pre-scaled integer transform, in Proc. IEEE ISCAS, May 2005, pp [7] W. K. Cham, Integer sinusoidal transform, Adv. Electron. Electron Phys., vol. 88, pp. 1 61, [8] W. K. Cham and C. K. Fong, ICT fixed point error performance analysis, in Proc Int. Symp. Intell. Multimedia, Video Speech Process., Oct. 2004, pp [9] J. Liang and T. D. Tran, Fast multiplierless approximations of the DCT with the lifting scheme, IEEE Trans. Signal Process., vol. 49, pp , Dec [10] M. Wien and S. Sun, ICT comparison for adaptive block transform, Doc. VCEG-L12. Eibsee, Germany, Jan [11] G. Bjontegaard, Calculation of average PSNR differences between RD-curves, Doc. VCEG-M33. Austin, TX, Apr [12] L. Yu, F. Yi, J. Dong, and C. Zhang, Overview of AVS-Video: Tools, performance and complexity, in Proc. VCIP, Beijing, China, Jul. 2005, pp [13] M. Wien, Variable block size transforms for H.264/AVC, IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp , Jul [14] S. Gordon, Adaptive block transform for film grain reproduction in high definition sequences, Doc. JVT-H029. Geneva, Switzerland, May [15] T. Wedi, Y. Kashiwagi, and T. Takahashi, H.264/AVC for next generation optical disc: A proposal on FRExt profiles, Doc. JVT-K025. Munich, Germany, Mar [16] M. Wien, Clean-up and improved design consistency for ABT, Doc. JVT-E025. Geneva, Switzerland, Oct [17] F. Bossen, ABT cleanup and complexity reduction, Doc. JVT-E087. Geneva, Switzerland, Oct [18] S. Gordon, Simplified use of transform, Doc. JVT-I022. San Diego, CA, Sep [19] S. Gordon, D. Marpe, and T. Wiegand, Simplified use of transform Proposal, Doc. JVT-J029. Waikoloa, HI, Dec [20] S. Gordon, D. Marpe, and T. Wiegand, Simplified use of transform Results, Doc. JVT-J030. Waikoloa, HI, Dec [21] J. Dong, J. Lou, C.-X. Zhang, and L. Yu, A new approach to compatible adaptive block-size transforms, in Proc. VCIP, Beijing, China, Jul. 2005, pp [22] C. Gomila and A. Kobilansky, SEI message for film grain encoding, Doc. JVT-H022. Geneva, Switzerland, May [23] C. Gomila, SEi message for film grain encoding: Syntax and results, Doc. JVT-I013. San Diego, CA, Sep [24] M. Schlockermann, S. Wittmann, and T. Wedi, Film grain coding in H.264/AVC, Doc. JVT-I034. San Diego, CA, Sep [25] C. Gomila and J. Llach, Film grain modeling versus encoding, Doc. JVT-K036. Munich, Germany, Mar [26] T. Wedi and S. Wittmann, Quantization with an adaptive dead zone size for H.264/AVC FRExt, Doc. JVT-K026. Munich, Germany, Mar [27] A. Puri, X. Chen, and A. Luthra, Video coding using the H.264/ MPEG-4 AVC compression standard, Signal Process.: Image Commun., vol. 19, no. 9, pp , Oct [28] JM7.6 and JM9.3 H.264 reference software [Online]. Available: iphome.hhi.de/suehring/tml [29] D. Marpe, H. Schwarz, and T. Wiegand, Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard, IEEE Trans. Circuits Syst. Video Technol, vol. 13, no. 7, pp , Jul [30] S. Srinivasan, P. J. Hsu, T. Holcomb, K. Mukerjee, S. L. Regunathan, B. Lin, J. Liang, M. C. Lee, and J. Ribas-Corbera, Windows media video 9: Overview and Applications, Signal Process.: Image Commun., vol. 19, no. 9, pp , Oct [31] S. Srinivasan and S. L. Regunathan, An overview of VC-1, in Proc. VCIP, Beijing, China, Jul. 2005, pp [32] Joint Video Team of ITU-T VCEG and ISO/IEC MPEG, Advanced Video Coding (AVC) ITU-T Rec. T.81 and ISO/IEC (MPEG-4 Part 10), Mar Cixun Zhang received the B.Eng. and M.Eng. degrees in information engineering from Zhejiang University, Hangzhou, China, in 2004 and 2006, respectively. He is currently working toward the Ph.D. degree in Department of Information Technology at Tampere University of Technology, Tampere, Finland. Since August 2006, he has been a Researcher in the Institute of Signal Processing, Tampere University of Technology, Tampere, Finland. His research interests include video compression and communication. Mr. Zhang was the recipient of the AVS special award from the AVS working group of China in 2005.

ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 97 Lu Yu (M 00) received the B.Eng. degree (hons.) in radio engineering and the Ph.D. degree in communication and electronic systems from Zhejiang University, Hangzhou, China, in 1991 and 1996, respectively.

She was a Senior Visiting Scholar in University Hannover and the Chinese University of Hong Kong in 2002 and 2004, respectively.

inventor or co-inventor of 14 granted and 13 pending patents. She published more than 80 academic papers and contributed 119 proposals to international and national standards in the recent years. Dr.

She organized the 15th International Workshop on Packet Video as a General Chair in 2006. She organized two special sessions and gave five invited talks and tutorials in international conferences.

142 ZHANG et al.: TECHNIQUE OF PRESCALED INTEGER TRANSFORM 97 Lu Yu (M 00) received the B.Eng. degree (hons.) in radio engineering and the Ph.D. degree in communication and electronic systems from Zhejiang University, Hangzhou, China, in 1991 and 1996, respectively. Since 1996, she has been with the faculty of Zhejiang University, Hangzhou, China, and is presently Professor of information and communication engineering. She was a Senior Visiting Scholar in University Hannover and the Chinese University of Hong Kong in 2002 and 2004, respectively. Her research areas include video coding, multimedia communication, and relative ASIC design, in which she is principal investigator of a number of national research and development projects and inventor or co-inventor of 14 granted and 13 pending patents. She published more than 80 academic papers and contributed 119 proposals to international and national standards in the recent years. Dr. Yu acts as the chair of the video subgroup of Audio Video coding Standard (AVS) of China and she was also the co-chair of implementation subgroup of AVS from 2003 to She organized the 15th International Workshop on Packet Video as a General Chair in She organized two special sessions and gave five invited talks and tutorials in international conferences. Now she serves as a member of Technical Committee of Visual Signal Processing and Communication of IEEE Circuits and Systems Society. Wai-Kuen Cham (S 77 M 79 SM 91) graduated from the Chinese University of Hong Kong in 1979 in electronics. He received the M.Sc. and Ph.D. degrees from Loughborough University of Technology, Loughborough, U.K., in 1980 and 1983, respectively. From 1984 to 1985, he was a Senior Engineer with Datacraft Hong Kong Limited and a Lecturer in the Department of Electronic Engineering, Hong Kong Polytechnic. Since May 1985, he has been with the Department of Electronic Engineering, the Chinese University of Hong Kong. His research interests include image coding, image processing and video coding. Jie Dong received the B.Eng. and M.Eng. degrees in Information Engineering from Zhejiang University, Hangzhou, China, in 2002 and 2005, respectively. She is currently working toward the Ph.D. degree in electronic engineering at the Chinese University of Hong Kong. Her research interests include HD video compression and processing. Jian Lou (S 07) was born in Hangzhou, Zhejiang, China. He received the B.E. and M.E. degrees in information science and electronic engineering from Zhejiang University, Hangzhou, China. He is currently working toward the Ph.D. degree in electrical engineering at the University of Washington, Seattle. His research interests include video processing and video compression. He has interned with several research laboratories including Microsoft Corporation, IBM T.J. Watson Research Center, Thomson R&D Laboratory, and Mitsubishi Electric Research Laboratories.

143 Publication 5 Cixun Zhang, Kemal Ugur, Jani Lainema, Antti Hallapuro, Moncef Gabbouj, Video Coding Using Spatially Varying Transform IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 21, Issue. 2, Feb. 2011, pp

144

145 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY Video Coding Using Spatially Varying Transform Cixun Zhang, Student Member, IEEE, Kemal Ugur, Jani Lainema, Antti Hallapuro, and Moncef Gabbouj, Fellow, IEEE Abstract In this paper, a novel algorithm called spatially varying transform (SVT) is proposed to improve the coding efficiency of video coders. SVT enables video coders to vary the position of the transform block, unlike state-of-art video codecs where the position of the transform block is fixed. In addition to changing the position of the transform block, the size of the transform can also be varied within the SVT framework, to better localize the prediction error so that the underlying correlations are better exploited. It is shown in this paper that by varying the position of the transform block and its size, characteristics of prediction error are better localized, and the coding efficiency is thus improved. The proposed algorithm is implemented and studied in the H.264/AVC framework. We show that the proposed algorithm achieves 5.85% bitrate reduction compared to H.264/AVC on average over a wide range of test set. Gains become more significant at medium to high bitrates for most tested sequences and the bitrate reduction may reach 13.50%, which makes the proposed algorithm very suitable for future video coding solutions focusing on high fidelity video applications. The gain in coding efficiency is achieved with a similar decoding complexity which makes the proposed algorithm easy to be incorporated in video codecs. However, the encoding complexity of SVT can be relatively high because of the need to perform a number of rate distortion optimization (RDO) steps to select the best location parameter (LP), which indicates the position of the transform. In this paper, a novel low complexity algorithm is also proposed, operating on a macroblock and a block level, to reduce the encoding complexity of SVT. Experimental results show that the proposed low complexity algorithm can reduce the number of LPs to be tested in RDO by about 80% with only a marginal penalty in the coding efficiency. Index Terms H.264/AVC, spatially varying transform (SVT), transform, variable block-size transforms (VBT), video coding. I. Introduction H.264/AVC is the latest international video coding standard which provides up to 50% gain in coding efficiency compared to previous standards. However, this is achieved at the cost of both increased encoding and decoding complexity. It is estimated in [1] that the encoder complexity increases Manuscript received April 16, 2009; revised September 29, 2009, February 1, 2010 and May 10, 2010; accepted September 11, Date of current version March 2, This work was supported by the Academy of Finland, Application , Finnish Program for Centers of Excellence in Research This paper was recommended by Associate Editor C.-W. Lin. C. Zhang and M. Gabbouj are with the Tampere University of Technology, FI Tampere, Finland ( cixun.zhang@tut.fi; moncef.gabbouj@tut.fi). K. Ugur, J. Lainema, and A. Hallapuro are with Nokia Research Center, FI Tampere, Finland ( kemal.ugur@nokia.com; jani.lainema@nokia.com; antti.hallapuro@nokia.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT /$26.00 c 2011 IEEE by more than one order of magnitude between Moving Picture Experts Group (MPEG)-4 Part 2 (Simple Profile) and H.264/AVC (Main Profile) and by a factor of 2 for the decoder. For mobile video services (video telephony, mobile TV, and others) and handheld consumer electronics (digital still cameras, camcorders, and others), additional complexity of H.264/AVC becomes an issue due to the limited resources of these devices. On the other hand, as display resolutions and available bandwidth/storage increase rapidly, high definition (HD) video is becoming more popular and commonly used, making the implementation of video codecs even more challenging. To better satisfy the requirements of increased usage of HD video in resource-constrained applications, two key issues should be addressed: coding efficiency and implementation complexity. In this paper, we propose a novel algorithm, namely, spatially varying transform (SVT), which provides coding efficiency gains over H.264/AVC. The concept of SVT was first introduced by the authors in [2] and later extended in [3]. The technique is developed and studied mainly for coding HD resolution video, but also proved to be useful for coding lower resolution video. The motivations leading to the development of SVT are two fold: 1) The typical block-based transform design in most existing video coding standards does not align the underlying transform with the possible edge location. In this case, the coding efficiency decreases. In [4], directional discrete cosine transforms (DCTs) is proposed to improve the efficiency of transform coding for directional edges. However, efficient coding of horizontal, vertical, and nondirectional edges was not addressed, whereas typically, in natural video sequences, such edges are more common than directional edges. This is the main reason why only very marginal overall gain has been achieved, as stated in [4]. Some toy examples are given below in (1) (3), which represent a vertical edge, a nondirectional edge, and a directional edge, respectively, where E 1, E 3, and E 5 are the original prediction error matrices, E 2, E 4, and E 6 are the corresponding prediction error matrices after we align the transform (by shifting the transform horizontally and vertically) to the edge location, and C i (i =1, 2, 3, 4, 5, 6) is the corresponding transform coefficient matrix. Fig. 1 shows the energy distribution of the transform coefficients, where the x-axis denotes the sum of x and y components of the coefficients in the transform coefficient matrix, and the y-axis denotes the percentage of the energy of the transform coefficients.

146 128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 We can see that by aligning the transform to the edge location, more energy in the transform domain concentrates in the low frequency part and this facilitates the entropy coding and improves the coding efficiency. It is interesting that this is true not only for horizontal, vertical, and nondirectional edges but also for directional edges as follows: E 3 = C 3 = E 1 = C 1 = E 2 = C 2 = E 4 = C 4 = E 5 = C 5 = E 6 = C 6 = after 2 D DCT after 2 D DCT after 2 D DCT after 2 D DCT after 2 D DCT (1) after 2 D DCT (2).(3) Fig. 1. Energy distribution of the transform coefficients before and after shifting the transform. 2) Coding the entire prediction error signal may not be the best in terms of rate distortion (RD) tradeoff. An example is the SKIP mode in H.264/AVC [5], which does not code the prediction error at all. We note that this fact is useful in improving the coding efficiency especially for HD video coding since, generally, a certain image area in HD video sequences contains less detail than in lower resolution video sequences, and the prediction error tends to have lower energy and more noise like after better prediction. More importantly, as we will see, utilizing this fact, our proposed algorithm can be implemented elegantly, e.g., on a macroblock basis, without boundary problem. Both of these two factors contribute to the coding efficiency improvement achieved by the SVT [2]. The basic idea of SVT is that we do not restrict the transform coding inside regular block boundaries but adjust it to the characteristics of the prediction error. With this flexibility, we are able to achieve coding efficiency improvement by selecting and coding the best portion of the prediction error in terms of RD tradeoff. Generally, this can be done by searching inside a certain residual region after intra prediction or motion compensation, for a subregion and only coding this subregion. Note that, when the region refers to a macroblock, the proposed algorithm can be considered as a special SKIP mode, where part of the macroblock is skipped. Finally, the location parameter (LP) indicating the position of the subregion inside the region is coded into the bitstream if there are nonzero transform coefficients. It is worth noting that the idea of shifting the transform, as used in the SVT, has been used in video and image denoising, where typically several denoised estimates produced for each shift of the transform are combined in a certain manner to produce the final denoised output [6] [9]. It is often used in post-processing of a decoded image or a video frame since its complexity is generally too high to be incorporated in an image or video decoder, for instance, as an in-loop filter. Also, the boundary problem of this algorithm has not received much attention, and the performance tends to become worse at the

147 ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 129 boundary and when it is applied to a smaller area such as a macroblock rather than a whole image or a video frame. By contrast, our proposed algorithm has no boundary problem and can be applied elegantly, e.g., on a macroblock basis. Also, using the proposed algorithm, there is no complexity penalty to the decoder since the LP of the subregion inside the region is coded in the bitstream and the decoder can simply use this information to reconstruct the region. For these reasons, the proposed algorithm can be easily incorporated in an image or a video decoder. In this case, various denoising algorithms [6] [9] may still be efficiently used as post-processing algorithms. In this paper, the proposed algorithm is studied and implemented in H.264/AVC framework. We show that 5.85% bitrate reduction is achieved compared to H.264/AVC on average over a wide range of test sets. Gains become more significant at medium to high bitrates for most tested sequences and the bitrate reduction can reach 13.50%, which makes the proposed algorithm very suitable for future video coding solutions focusing on high fidelity video applications. The main drawback of the proposed algorithm is that the encoding complexity is higher mainly due to the brute force search process to select the best LP for the coded subregion. Nevertheless, the additional encoding complexity of the proposed algorithm is acceptable, for offline encoding application cases. For other cases with strict requirement of encoding complexity, a fast algorithm should be used. For such applications, we propose a fast algorithm operating on macroblock and block level to reduce the encoding complexity of SVT [10]. The proposed fast algorithm first skips testing SVT for macroblock modes for which SVT is unlikely to be useful by examining the RD cost of macroblock modes without SVT on a macroblock level. For the remaining macroblocks that SVT may be useful, the proposed fast algorithm selects available candidate LP based on motion difference and utilizes a hierarchical search algorithm to select the best available candidate LP at the block level. Experimental results show that the proposed fast algorithm can reduce the number of candidate LPs that need to be tested in search process by about 80% with only a marginal penalty in the coding efficiency. The remainder of this paper is organized as follows. The proposed algorithm is introduced in Section II and its integration into H.264/AVC framework is described in Section III. The fast algorithm is introduced in Section IV. Experimental results are given in Section V. Section VI concludes this paper. II. SVT Transform coding is widely used in video coding standards to decorrelate the prediction error and achieve high compression rates. Typically, in video coding, a transform is applied to the prediction error at fixed locations. However, this has several drawbacks that may hurt the coding efficiency and decrease the visual quality. First of all, if the localized prediction error at a fixed location has a structure that is not suitable for the underlying transform, many high frequency transform coefficients will be generated thus requiring a large number of bits to code. Consequently, the coding efficiency decreases. Moreover, notorious visual artifacts such as ringing may appear when these high frequency coefficients get quantized. In this paper, SVT is proposed to reduce these drawbacks of transform coding. The basic idea of SVT is that the transform coding is not restricted inside regular block boundary but can be applied to a portion of the prediction error according to the characteristics of the prediction error. In this paper, we select and code a single block inside a macroblock when applying SVT. Extension to multiple blocks is straightforward. The selection is due to the fact that coding on a macroblock basis reduces complexity, delay, and throughput, which facilitate the implementation especially for hardware. This means that the position and shape of the transform block within a macroblock is variable, and information about its shape and location is signaled to the decoder, when there are nonzero transform coefficients. However, it should be noted that there is no restriction on the size and shape of a subregion and a region and thus the proposed idea could easily be extended to cover arbitrary sizes and shapes. For instance, SVT can be applied to an extended macroblock [11] of size larger than Directional SVTs with a directional oriented block selected and coded with a corresponding directional transform [4] can also be used. In the following sub-sections, we further discuss three key issues of SVT in more detail: selection of SVT block-size, selection and coding of candidate LP, and filtering of SVT block boundaries. A. Selection of SVT Block-Size In this paper, an M N SVT is applied on a selected M N block inside a macroblock of size 16 16, and only this M N block is coded. It is easy to see that there are in total (17 M) (17 N) possible LPs if the M N block is coded by an M N transform. The following factors are taken into account when choosing suitable M and N. 1) Generally, larger M and N will result in fewer possible LPs and less overhead, and vice versa. 2) Also, larger M and N will result in lower distortion of the reconstructed macroblock but need more bits in coding the transform coefficients, and vice versa. 3) Finally, a larger block-size transform is more suitable in coding relatively flat areas or non-sharp edges, without introducing coding artifacts such as ringing, while a smaller block-size transform is more suitable in coding areas with details or sharp edges. To facilitate the transform design, we can further assume M =2 m, N =2 n, where m and n are integers ranging from 0 to 4 inclusive in our context. 1 In this paper, we use 8 8 SVT, 16 4 SVT, and 4 16 SVT, which are illustrated in Figs. 2 and 3, respectively, to show the effectiveness of the proposed algorithm. Nevertheless, it should be noted that, for different sequences with different characteristics, other blocksize SVT can be more suitable in terms of coding efficiency, complexity, and others. For instance, larger block-size SVT, 1 Note that SKIP mode is a special case when m and n are both equal to 0 and the macroblock partition is (and the motion vector equals the predicted value in the context of H.264/AVC), and transform is another special case when m and n are both equal to 4.

130 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 2. Illustration of 8 8 SVT. to be specified. Here, we use the 4 4 transform kernel in H.

In all cases, normal zig-zag scan (for frame-based coding which is used in our experiments) is used to represent the transform coefficients as input symbols to the entropy coding.

148 130 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 2. Illustration of 8 8 SVT. to be specified. Here, we use the 4 4 transform kernel in H.264/AVC [5], [13] and the transform kernel in [14] because it is simple and can reuse the butterfly structure of the existing 8 8 transform in H.264/AVC. It is noted that different transform kernels could be used, which, however, is not the main focus of our proposed algorithm. In all cases, normal zig-zag scan (for frame-based coding which is used in our experiments) is used to represent the transform coefficients as input symbols to the entropy coding. Different scan patterns were also tested but not much gain in coding efficiency was achieved. The block-size of SVT for luma component can be used to derive the corresponding block-size of SVT for the corresponding chroma components. Take 4:2:0 chroma format as an example, if M N SVT is used for the luma component, then M/2 N/2 SVT can be used for chroma components. Fig. 3. Illustration of 16 4 and 4 16 SVT. such as 16 8 SVT or 8 16 SVT, may be more preferable in coding flat areas and when using a higher quantization parameter (QP). It is well established in video coding community that variable block-size transforms (VBT) can improve the coding efficiency [12]. Using VBT in the framework of SVT, namely, variable block-size SVT (VBSVT), we will show that the prediction error is better localized and the underlying correlations are better exploited; thus, the coding efficiency is improved compared to a fixed block-size SVT. However, when the number of different block-size SVT used becomes large, the coding gain of VBSVT tends to saturate since more overhead bits are needed to code the LPs. In this paper, we adaptively choose among 8 8 SVT, 16 4 SVT, and 4 16 SVT in VBSVT. It may also prove beneficial to choose different block-size SVT at different QPs and different bitrates. It is also possible and beneficial for the encoder to code SVT parameters, such as M, N (or m, n), in the bitstream, and transmit them in the slice header in H.264/AVC framework. The transforms used for 8 8, 16 4, and 4 16 SVTs are described in the following. In general, the separable forward and inverse 2-D transform of a 2-D signal can be written, respectively, as follows: C = T v X T t h (4) X r = T t v C T h (5) where X denotes a matrix representing M N pixel block, C is the transform coefficient matrix, and X r denotes a matrix representing the reconstructed signal block. T v and T h are the M M and N N transform kernels in the vertical and the horizontal direction, respectively. The superscript t denotes matrix transposition. For 8 8 SVT, we simply reuse the 8 8 transform kernel in H.264/AVC [5]. For 16 4 SVT and 4 16 SVT, a 4 4anda16 16 transform kernels need B. Selection and Coding of Candidate LPs When there are nonzero transform coefficients of the selected SVT block, its location inside the macroblock needs to be coded and transmitted to the decoder. For 8 8 SVT, as shown in Fig. 2, the location of the selected 8 8 block inside the current macroblock can be denoted by ( x, y) which can be selected form the set 8 8 = {( x, y), x, y =0,..., 8}. There are in total 81 candidates for 8 8 SVT. For 16 4 SVT and 4 16 SVT, as shown in Fig. 3, the locations of the selected 16 4 and 4 16 block inside the current macroblock can be denoted as y and x, respectively, which can be selected from the set 16 4 = { y, y =0,..., 12} and 4 16 = { x, x =0,..., 12}, respectively. There are in total 26 candidates for 16 4 SVT and 4 16 SVT. The best LP is then selected according to a given criterion. In this paper, rate distortion optimization (RDO) [15] is used to select the best LP in terms of RD tradeoff by minimizing the following: J = D + λ R (6) where J is the RD cost of the selected configuration, D is the distortion, R is the bitrate, and λ is the Lagrangian multiplier. In our implementation, the reconstruction residual for the remaining part of the macroblock is simply set to be 0, but different values can be used and might be beneficial in certain cases (luminance change, and others). It may be useful to also utilize the reconstructed SVT block and neighboring macroblock to derive the reconstructed value for the remaining part of the macroblock. The set of candidate LPs is also important since it directly affects the encoding complexity and the performance of SVT. Larger number of candidate LPs provides more room and is more robust for coding efficiency improvement for different sequences with different characteristics, but adds more encoding complexity and also more overhead. Experimentally, according to criterion (6), we choose to use 8 8 = {( x, y), x = 0,..., 8, y = 0; x = 0,..., 8, y = 8; x = 0, y =1,..., 7; x =8, y =1,..., 7} for 8 8 SVT and 16 4 = { y, y =0,..., 12} and 4 16 = { x, x = 0,..., 12} for 16 4 SVT and 4 16 SVT, which shows

149 ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 131 good RD performance over a large test set [2]. There are in total 58 candidate LPs, and statistics show that the measured entropy of the LP index is 5.73 bits, for all test sequences used in the experiments. Accordingly, the LP index is represented by a 6-bit fixed length code in our implementation. More advanced representation methods can be designed to achieve a better compression performance. These LPs are shown to have a good tradeoff in coding efficiency and complexity for different test sequences. As the overhead bits to code the LPs become significant at low bitrates, it would be useful to choose different LPs at different QPs and different bitrates. Similarly, an indication of available LP can also be coded and transmitted in the slice header in H.264/AVC framework. The LP for the luma component can be used to derive the corresponding LPs for chroma components. Consider the 4:2:0 chroma format as an example, LPs for chroma components can be derived from the LP for the luma component as follows: x C =( x L +1)>> 1, y C =( y L +1)>> 1 (7) where x L, y L are the LPs for the luma component, while x C, y C are the LPs for the chroma components. C. Filtering of SVT Block Boundaries Due to the coding (transform and quantization) of the selected SVT block, coding artifacts may appear around its boundary with the remaining non-coded part of the macroblock. A deblocking filter can be designed and applied to improve both the subjective and objective quality. An example in the framework of H.264/AVC will be described in detail in Section III-D. III. Implementing SVT in the H.264/AVC Framework The proposed technique is implemented and tested in the H.264/AVC framework. Fig. 4 shows the block diagram of the extended H.264/AVC encoder with SVT. The encoder first performs motion estimation and decides the optimal mode to be used, and then searches for the best LPs (illustrated as SVT Search in the diagram) if SVT is used for this macroblock. The encoder then calculates the RD cost for using SVT using (6), and if lower RD cost can be achieved by using SVT then that macroblock is selected to be coded with SVT. For macroblocks that use SVT, the LPs are coded and transmitted in the bitstream when there are nonzero transform coefficients of the selected SVT block. The LPs for the macroblocks that use SVT are decoded in the box marked as SVT L.P. Decoding in the diagram. However, we note that the normal criteria, e.g., sum of absolute differences or sum of squared differences, which is used in these encoding processes for macroblocks that do not use SVT, may not be optimal for macroblocks that use SVT. Better encoding algorithms are under study. Several key parts of the H.264/AVC standard [5], for instance, macroblock types, coded block pattern (CBP), entropy coding, and deblocking, need to be adjusted. The proposed modifications aiming at good compatibility with H.264/AVC are described in the following sub-sections. Fig. 4. Block diagram of the extended H.264 encoder with SVT. TABLE I Extended Macroblock Types for P Slices in H.264/AVC with SVT mb type Name of mb type 0 P P SVT 2 P P 16 8 SVT 4 P P 8 16 SVT 6 P P 8 8 SVT 8 P 8 8ref0 a 9 P 8 8ref0 SVT Inferred b P Skip a P 8 8ref0 is a macroblock mode defined in H.264/AVC with 8 8 macroblock partition and the reference index of this mode shall be inferred to be equal to 0 for all sub-macroblocks of the macroblock [5]. b No further data is present for P Skip macroblock in the bitstream and it can be inferred [5]. A. Macroblock Type We focus our study of SVT on coding the inter prediction error in P slices although the idea can be easily extended for I and B slices. Table I shows the extended macroblock types for P slices in H.264/AVC [5] (original intra macroblock types in H.264/AVC are not included). We use SVT for 16 16, 16 8, 8 16, and 8 8 macroblock partitions, based on the observation that SVT is used in a considerable portion when the same macroblock partition is concerned. This is shown in Fig. 5, where the statistics is based on all the test sequences we use and the overall test QP when context adaptive variable length coding (CAVLC) is used as the entropy coding. The macroblock type index is coded in a similar way as H.264/AVC. The sub-macroblock types are kept unchanged and therefore not shown in the table. B. CBP In this implementation, SVT is used for coding only the luma component. As shown in Figs. 2 and 3, since only one block is selected and coded in macroblocks that use SVT, we can either use 1 bit for luma CBP or jointly code it with chroma CBP as is done in H.264/AVC [5] when CAVLC is used as the entropy coding. However, in our experiments with

150 132 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 many test sequences, we found that luma CBP is often equal to 1 in high fidelity video coding, which is our main focus in this paper. Based on this observation, we restrict the new macroblock modes to have luma CBP equal to 1 and thus there is no need to code it. Chroma CBP is not changed and it is represented in the same way as in H.264/AVC [5]. An alternative way would be to infer the luma CBP according to QP or bitrate. C. Entropy Coding In H.264/AVC [5], when CAVLC is used as the entropy coding, a different coding table for the total number of nonzero transform coefficients and trailing ones of current block is selected depending on the characteristics (the number of nonzero transform coefficients) of the neighboring blocks. For macroblocks that use SVT, a fixed coding table is used for simplicity without loss of coding efficiency. Besides, we may also need to derive the information about the number of nonzero transform coefficients contained in each 4 4 luma block. When the selected SVT block aligns with the regular block boundaries, no special scheme is needed. Otherwise, the following two-step scheme with negligible complexity increase is used in our implementation as follows. Step 1: A 4 4 luma block is marked to have nonzero transform coefficients if it overlaps with a coded block that has nonzero transform coefficients in the selected SVT block, and marked not to have nonzero transform coefficients otherwise. This information may also be used in other process, e.g., deblocking. Step 2: The number of nonzero transform coefficients, nc B, for each 4 4 block that is marked to have nonzero transform coefficients, is empirically set to nc B = (nc MB + n B /2 )/n B (8) where nc MB is the total number of nonzero transform coefficients in the current macroblock and n B is the number of blocks marked to have nonzero transform coefficients, and. denotes the floor operator. In (8), we (approximately) distribute the total number of nonzero transform coefficients evenly to all the blocks that are marked as having nonzero transform coefficients. This is easy to calculate and it is shown experimentally to have similar coding efficiency compared to a more sophisticated scheme, for instance, distributing the total number of nonzero transform coefficients to all the blocks that are marked as having nonzero transform coefficients according to the overlapping size of the SVT block and each underlying 4 4 block. Note that due to the floor operation in (8), a 4 4 block may be marked to have nonzero transform coefficients according to Step 1, but has zero nonzero transform coefficients according to (8) in Step 2. 2 In our implementation, when only the information about whether a block has nonzero transform 2 Alternatively, one can set the number of nonzero transform coefficients to be 1 in this case. However, experimentally this does not show benefit over (8) in terms of coding efficiency. TABLE II Increase of Number of Edges to Be Checked in Deblocking for Luma Component When SVT is Used (CABAC, 720p) Sequence Increase of Number of Edges to Be Checked in Deblocking (%) BigShips 32.8 ShuttleStart 31.3 City 39.5 Night 15.9 Optis 26.6 Spincalendar 27.6 Cyclists 20.2 Preakness 21.7 Panslow 33.6 Sheriff 23.6 Sailormen 27.2 Average 27.3 coefficients or not is needed (e.g., in deblocking), we use the result of Step 1; while when the information about how many nonzero transform coefficients a block has is needed, we use the result of Step 2 (e.g., in CAVLC). The coding of the transform coefficients in CAVLC and context-based adaptive binary arithmetic coding (CABAC) is not changed, and thus no additional complexity is introduced. D. Deblocking For macroblocks that use SVT, the deblocking process in H.264/AVC [5] needs to be adjusted because the selected SVT block may not align with the regular block boundaries. Two cases are shown in Fig. 6: one is when 4 4 transform is not used as an optional transform and the other is when 4 4 transform is also used as an optional transform. The edges of the selected SVT block and the macroblock shown in Fig. 6 may be filtered. The filtering criteria and process of these edges are similar to those used in H.264/AVC [5], [16] by first deriving the necessary information of the blocks on both side of the boundary such as having nonzero transform coefficients or not (using Step 1 in Section III-C), motion difference, and others, with minor modifications. The existing regular structure of deblocking in H.264/AVC is made less regular by SVT because the location of the edges of selected SVT block are not fixed. This could be an important issue especially in hardware implementation because it affects the data flow. Meanwhile, when SVT is used, the number of edges to be checked in deblocking would increase due to the fact that some of the macroblocks originally coded in SKIP mode will now be coded with SVT. According to our preliminary results, for instance, when 4 4 transform is also used as an optional transform, the number of edges to be checked in deblocking increases on average 27.3% for luma component over the 720p test set and all the tested QPs used in our experiment. Detailed results are shown in Table II. IV. Fast Algorithms for SVT As described in Section III, we note that even though we do not change the motion estimation, sub-macroblock

ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 133 Fig. 5. Percentage of the use of SVT for different macroblock partitions when CAVLC is used. (a) Without using 4 4 transform.

151 ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 133 Fig. 5. Percentage of the use of SVT for different macroblock partitions when CAVLC is used. (a) Without using 4 4 transform. (b) Using 4 4 transform. cases that have strict requirement of encoding complexity, fast algorithm of SVT should be developed and used. In this section, we propose a simple yet efficient fast algorithm operating on macroblock and block level to reduce the encoding complexity of SVT. The basic idea to reduce the encoding complexity of SVT is to reduce the number of candidate LPs tested in RDO. The proposed fast algorithm first skips testing SVT for macroblocks for which SVT is unlikely to be useful by examining the RD cost of macroblock modes without SVT on a macroblock level. For the remaining macroblocks that SVT may be useful, the proposed fast algorithm selects available candidate LPs based on the motion difference and utilizes a hierarchical search algorithm to select the best available candidate LP on a block level. The macroblock level and block-level algorithms are described in Sections IV-A and IV-B, respectively. A. Macroblock Level Fast Algorithm The basic idea of macroblock level fast algorithm is to skip testing SVT for macroblock modes for which SVT is unlikely to be useful. This is done by examining the RD cost of macroblock modes that do not use SVT and are already available prior to SVT coding. In the proposed implementation, SVT is only applied for the macroblock modes in RDO process, when the two criteria are met as follows: min(j inter,j skip ) J intra (9) Fig. 6. Illustration of edges that may be filtered when SVT is used. Bold lines represent the edges of the SVT block. (a), (c), (e) When 4 4 transform is not used as an optional transform. (b), (d), (f) When 4 4 transform is also used as an optional transform. partition decision process for the macroblocks that use SVT, the encoding complexity of SVT is higher due to the brute force search process in RDO. For example, in case of VBSVT described in Section II, there are a total of 58 candidate LPs for one macroblock mode and we need to conduct RDO for each candidate LP to select the best one. Note that typically we need to conduct transform, quantization, entropy coding, inverse transform, inverse quantization, and others to calculate the RD cost in (6) and this complexity is high. 3 Thus, for application 3 Of course, many fast algorithms can be used to reduce the complexity of transform, quantization, entropy coding, and in estimating the RD cost. J mode min(j inter,j skip )+th (10) where J inter and J intra are the minimum RD cost of available inter and intra macroblock modes in regular (without SVT) coding, respectively, J skip is the RD cost of the SKIP coding mode, and J mode is the RD cost of the current macroblock mode to be tested with SVT (e.g., if SVT is being tested for INTER 16 8 mode, then J mode refers to RD cost of the regular INTER 16 8 without SVT). In (9), by comparing inter, intra, and SKIP modes, we assume that if the RD cost for intra modes is the lowest, then the probability for inter modes with SVT to have a lower RD cost will be very small, so there is no need to check SVT for this macroblock. The basis of this assumption is that intra modes are often used in coding detailed areas for which SVT is unlikely to be useful since part of the macroblock is not coded when it is used. In (10), we assume that if the RD cost for the current mode is much

152 134 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 7. Derivation of (11). larger than the RD cost for the best mode, then the probability for this mode with SVT to have a lower RD cost is also expected to be very small. The threshold th in (10) represents the empirical upper limit of bitrate reduction when SVT is used assuming the reconstruction of macroblock remains the same when SVT is not used. It is calculated using (11), which is derived empirically using the following statistical analysis. We used four HD sequences, Crew, Harbour, Raven, and Jets as train set. The sequences are first encoded and the values J mode, J inter, J intra, and J skip are recorded for each mode in each macroblock that is coded with SVT. According to resulting costs, a threshold value th in (10) is calculated for each QP so that a certain number of the macroblocks satisfy the corresponding inequality in (9) and (10). This results for a different threshold value for each QP and using these threshold values (11) is derived by fitting a first order polynomial to the data. This simple derivation is also illustrated in Fig. 7 as follows: th = λ max(23 QP/2, 0) (11) where λ is the Lagrangian multiplier used in (6) and the max function returns the maximum value of its arguments. As reflected in (11), the reason that the threshold value th, which represents the upper limit of bitrate reduction, tends to get lower when QP is larger, is mainly because details/edges are likely to be quantized and thus the benefit of SVT to align the transform to the underlying details/edges is compromised. B. Block-Level Fast Algorithm The proposed block-level fast algorithm includes two steps: the selection of available candidate LPs based on the motion difference in the first step and a hierarchical search algorithm in the second step. These two steps are described in detail below. 1) Selection of Available Candidate LPs Based on Motion Difference: SVT is used here for 16 16, 16 8, 8 16, and 8 8 macroblock partitions, as described in Section III-A. One straightforward approach to reduce the encoding complexity is to restrict the transform block in a candidate SVT block to be inside the same motion compensation block boundary, since applying the transform across motion compensation block boundaries is usually considered not efficient. In other words, it is typically assumed that edges exist at different motion compensation block boundaries that the transform should align to. This approach is simple because only macroblock partition information is used. However, some penalty was observed in coding efficiency for sequences with slow motion and rich detail, for instance, Preakness and Night. This is because prediction from different motion compensation blocks is not the major reason for causing blocking effect [16] which is likely to degrade the efficiency of the transform coding. Actually, it may not even cause blocking effect if the motion difference of different motion compensation blocks is small enough and/or the prediction is good. 4 In this case, edges do not necessarily appear at different motion compensation block boundaries where the transform should align to and SVT would be still useful to improve coding efficiency. Taking this into account, we use a more general approach in this paper, by skipping the testing of candidate LPs when and only when the transform block(s) in the SVT block at that position overlaps with different motion compensation blocks in which the motion difference is significant according to a certain criterion. In order to measure the motion difference, we use a similar method to the one used in deriving the boundary strength parameter in deblocking filter of H.264/AVC [5], [16]. Specifically, in our implementation, we skip testing a candidate LP and mark it unavailable if at least one of the following conditions is true. a) If the transform applied to the SVT block at that position overlaps with at least two neighboring motion compensation blocks and the motion vectors of these blocks are larger than or equal to a predefined threshold which is set to be one integer pixel in this paper. b) If the transform applied to the SVT block at that position overlaps with at least two neighboring motion compensation blocks and the reference frames of these two neighboring blocks are different. Since the number of available candidate LPs varies from one macroblock to another and the information to derive the available candidate LPs can be obtained both by the encoder and the decoder, the index of the selected LP is coded as follows. Assume there are N (N >0) available candidate LPs and the index of the selected candidate is n(0 n<n), then it is coded as follows: { V = n, L = log2 N +1, if n < 2(N 2 log N 2 ) V = n (N 2 log N 2 ),L= log 2 N (12), otherwise where V represents the binary value of the coded bit string and L represents the number of bits coded. This is a near fixed-length code which is chosen based on the observation that all available candidate LPs are similarly likely to be used. 4 We would also like to point out that it is shown in a recent paper that despite the transform residual block being composed of two parts that were predicted from different areas of the reference picture, correlation within mixed blocks is very similar to that of normal blocks. The DCT is only marginally suboptimal with respect to Karhunen-Loeve transform [17].

153 ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 135 This approach shows a stable coding efficiency for sequences with different characteristics and achieves similar gain over a wide range of test sets compared to the original algorithm without using the information of motion difference. Finally, we note that the additional complexity introduced by this approach is marginal when it is carefully implemented because: 1) the decision is only conducted when necessary and can be skipped for macroblock modes with partition and some LPs, e.g., the LP (0, 0) for 8 8 SVT; 2) the decision is simple and only uses the motion vector and the reference frame information; and 3) generally several candidate LPs representing spatially consecutive blocks can be marked available or unavailable at the same time in one decision. 2) Hierarchical Search Algorithm: The basic idea of the hierarchical search algorithm is similar to that of the motion estimation algorithm, i.e., to first find the best LP in a relatively coarse resolution and then refine the results in a finer resolution. This assumes that the prediction error structure remains similar in the coarse resolution as in the original resolution, which is often the case for a macroblock of size especially in HD video material. In our implementation, we define the candidate LP in coarser resolution as key candidate LP, which are marked as squares in Fig. 8. Other candidate LPs are also called as non-key candidate LP, which are marked as circles in Fig. 8. The hierarchical search algorithm can be summarized as follows. a) Let 1 denote the set of all available key candidate LPs. Select the best one in 1 with the lowest RD cost and let 2 denote its available neighboring candidate LPs which are marked as triangles in Fig. 8. b) The key LPs are divided into 14 LP zones as shown in Fig. 8. Select the best LP zone which is available and has the lowest RD cost. A LP zone is available if and only if all three key candidate LPs in that zone are available and the RD cost of an LP zone is defined to be the sum of the RD costs of the three key candidate LPs in that zone. Let 3 denote the additional available non-key candidate LPs which are inside the best LP zone (marked as stars in Fig. 8). c) Select the best LP which has the lowest RD cost among all the candidates in 1, 2, and 3. More sophisticated algorithms using the same basic idea can be designed, for instance, by using different definitions of key position and zone, and/or examining different number of good key candidate LPs and zones. V. Experimental Results We implemented the proposed algorithm VBSVT and its fast version with the proposed fast algorithm, which we denote as FVBSVT, in KTA1.8 reference software [18] in order to evaluate their effectiveness. Although our primary interest is in HD video coding, we also test the proposed algorithm in lower resolution [common intermediate format (CIF)] video coding. The most important coding parameters used in the experiments are listed as follows. 1) High Profile. Fig. 8. Illustration of hierarchical search algorithm. 2) QP I = 22, 27, 32, 37, QP P = QP I + 1, fixed QP is used and rate control is disabled according to the common conditions that are set up by ITU-T VCEG and ISO-IEC/MPEG community for coding efficiency experiments [19]. 3) CAVLC/CABAC is used as the entropy coding. 4) Frame structure is IPPP, four reference frames. 5) Motion vector search range: ±64/±32 pels for 720p/CIF sequences, resolution 1 4 -pel. 6) RDO in the high complexity mode. Two configurations are tested as follows. 1) Low complexity configuration: Motion estimation block sizes are 16 16, 16 8, 8 16, 8 8, and 4 4 transform is not used for macroblocks coded with regular transform or SVT. This represents a low complexity codec with most effective tools for HD video coding. 2) High complexity configuration: Motion estimation block sizes are 16 16, 16 8, 8 16, 8 8, 8 4, 4 8, 4 4, and 4 4 transform is also used as an optional transform for both macroblocks coded with regular transform or SVT. This represents a high complexity codec with full usage of the tools provided in H.264/AVC. We measure the average bitrate reduction ( BD-RATE) compared to H.264/AVC using Bjontegaard tool [20] for both low complexity and high complexity configurations. The Bjontegaard tool measures the numerical average in data rate or quality differences between two rate distortion curves. This is a more compact way to present the performance comparison and required by MPEG and VCEG community. The results are shown in Tables III VII. We also show the results of 8 8 SVT and SVT for comparison. As we can see, SVT performs better than 8 8 SVT while VBSVT performs best, both in low and high complexity configurations. VBSVT achieves on average 4.11% (up to 6.19%) bitrate reduction in low complexity configuration and on average 2.43% (up to 3.48%) bitrate reduction in high complexity configuration, respectively, when CAVLC is used, and achieves on average 5.85% (up to 7.92%) bitrate reduction in low complexity configuration and on average 4.40% (up to 6.25%) bitrate reduction in high complexity configuration, respectively, when CABAC is used, compared to H.264/AVC. The gain of VBSVT is lower in high complexity configuration, because the overhead to code the LP becomes more significant

154 136 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 TABLE III BD-Rate Compared to H.264/AVC (Low Complexity Configuration) (CABAC, 720p) Sequence 8 8 SVT (%) SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%) BigShips ShuttleStart City Night Optis Spincalendar Cyclists Preakness Panslow Sheriff Sailormen Average TABLE IV BD-Rate Compared to H.264/AVC (Low Complexity Configuration) (CABAC, CIF) Sequence 8 8 SVT (%) SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%) Bus Container Foreman Hall Mobile News Paris Tempete Average TABLE V BD-Rate Compared to H.264/AVC (High Complexity Configuration) (CABAC, 720p) Sequence 8 8 SVT (%) SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%) BigShips ShuttleStart City Night Optis Spincalendar Cyclists Preakness Panslow Sheriff Sailormen Average TABLE VI BD-Rate Compared to H.264/AVC (High Complexity Configuration) (CABAC, CIF) Sequence 8 8 SVT (%) SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%) Bus Container Foreman Hall Mobile News Paris Tempete Average

155 ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 137 TABLE VII Average Results of BD-Rate Compared to H.264/AVC Average Results 8 8 SVT (%) SVT (%) VBSVT (%) FVBSVT (%) Reduction of Candidate LPs (%) Average (CAVLC, 720p, low complexity configuration) Average (CAVLC, CIF, low complexity configuration) Average (CAVLC, 720p, high complexity configuration) Average (CAVLC, CIF, high complexity configuration) Average (CABAC, 720p, low complexity configuration) Average (CABAC, CIF, low complexity configuration) Average (CABAC, 720p, high complexity configuration) Average (CABAC, CIF, high complexity configuration) TABLE VIII Percentage of 8 8SVT and SVT Selected in VBSVT for Different Sequences (CABAC, 720p) Sequence Low Complexity Configuration High Complexity Configuration 8 8 SVT (%) SVT (%) 8 8 SVT (%) SVT (%) BigShips ShuttleStart City Night Optis Spincalendar Cyclists Preakness Panslow Sheriff Sailormen Average TABLE IX Percentage of 8 8 SVT and SVT Selected in VBSVT for Different Sequences (CABAC, CIF) Sequence Low Complexity Configuration High Complexity Configuration 8 8 SVT (%) SVT (%) 8 SVT (%) SVT (%) Bus Container Foreman Hall Mobile News Paris Tempete Average and the 4 4 transform in H.264/AVC (and smaller motion estimation block sizes) is more efficient in certain cases. The gain of VBSVT is higher in HD video coding because typically less detail is contained in a certain image area in HD video sequences than in lower resolution video sequences and thus the non-coded part of the macroblock can be more safely skipped without introducing much degradation in reconstructed quality. Fig. 9 shows the R D curves for sequences Night and Panslow. We note that the gain of VBSVT comes at medium to high bitrates for many tested sequences and can be much more significant than the average gain over all bitrates reported above. This is true for most sequences we tested. Take Panslow sequence as an example, by using the performance evaluation tool provided in [21], we are able to show that VBSVT achieves 13.50% and 6.90% bitrate reduction at 37 db compared to H.264/AVC, in low and high complexity configuration, respectively, when CAVLC is used. We also measure the percentage of 8 8 SVT and SVT selected in VBSVT. The results are shown in Tables VIII X. We can see that in VBSVT scheme, SVT are used generally more often than 8 8 SVT. However, 8 8 SVT is also used in a significant portion of the cases. This shows that different block-size SVT can be suitable for different situations and VBSVT would be a more preferable algorithm for coding prediction error with different characteristics than fixed block-size SVT. In Figs. 10 and 11, we show the distribution of the macroblocks coded with SVT and corresponding SVT blocks in a prediction error frame for Night and Sheriff sequences, respectively. It can be observed that, as expected, SVT is mostly useful for macroblocks where residual signals of larger magnitude are likely to gather and locate at certain regions, which frequently happens, as also observed by other researchers [22]. However, when using RDO, the selection of the SVT depends on the distortion of the whole macroblock and

138 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 10.

R D curves for (a) Night sequence (CABAC) and (b) Panslow sequence (CAVLC).

Macroblocks with residual signal of low magnitude are often coded in SKIP mode. We also view the reconstructed video sequences in order to analyze the impact of SVT on subjective quality.

156 138 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 Fig. 10. Distribution of the macroblocks coded with SVT and corresponding SVT blocks for 32nd prediction error frame (Night sequence). Fig. 9. R D curves for (a) Night sequence (CABAC) and (b) Panslow sequence (CAVLC). the coded bits of the selected SVT block, so it is also possible that SVT can also be chosen to be used for macroblocks with relatively evenly distributed residual signal of medium magnitude. Macroblocks with residual signal of low magnitude are often coded in SKIP mode. We also view the reconstructed video sequences in order to analyze the impact of SVT on subjective quality. We conclude that using SVT does not introduce any special visual artifacts and the overall subjective quality remains similar, while the coded bits are reduced. Similarly, we measure the average bitrate reduction ( BD- RATE) of FVBSVT compared to H.264/AVC. We also measure the average reduction of candidate LPs tested in RDO over all the tested QPs using FVBSVT. The results are also shown in Tables III VII. We can see that with the proposed fast algorithm we can reduce by a little more than 80% the number of candidate LPs tested in RDO while retaining most of the coding efficiency, for both low and high complexity configurations. The reduction in high complexity configuration is slightly less than that in low complexity configuration because when the 4 4 transform is used, the number of candidate LPs is increased according to the selection method based on motion difference described in Section IV-B. This means that on average about candidate LPs are tested in RDO (each RD cost calculation includes transform, quantization, entropy coding, inverse transform, inverse quantization, and others) when Fig. 11. Distribution of the macroblocks coded with SVT and corresponding SVT blocks for 65th prediction error frame (Sheriff sequence). FVBSVT is used. We also measured the execution time of FVBSVT to estimate the computational complexity. Compared to H.264/AVC, the encoding and decoding time of FVBSVT increases on average 2.49% and 2.05%, respectively, over 720p test set and all the tested QPs in high complexity configuration. The encoding and decoding time increase vary for different test sequences and detailed results are shown in Table XI. The running time increase is marginal because only a relatively small part of the whole coding process is affected. The memory usage remains similar when SVT is used because it needs to store the number of nonzero transform coefficients, access the pixels

157 ZHANG et al.: VIDEO CODING USING SPATIALLY VARYING TRANSFORM 139 TABLE X Average Results of Percentage of 8 8 SVT and SVT Selected in VBSVT for Different Sequences Average Results Low Complexity Configuration High Complexity Configuration 8 8 SVT (%) SVT (%) 8 8 SVT (%) SVT (%) Average (CAVLC, 720p) Average (CAVLC, CIF) Average (CABAC, 720p) Average (CABAC, CIF) TABLE XI Encoding/Decoding Time Increase of FVBSVT Compared to H.264/AVC (High Complexity Configuration) (CABAC, 720p) Sequence Encoding Time Decoding Time Increase of FVBSVT (%) Increase of FVBSVT (%) BigShips ShuttleStart City Night Optis Spincalendar Cyclists Preakness Panslow Sheriff Sailormen Average in neighboring blocks in deblocking and store the motion vector and reference frame information etc, which is similar to the case when SVT is not used. The simulation PC we use is Intel(R) Core(TM)2 Quad CPU GHz 2.39 GHz, 2.00 GB of RAM, and the operating system is Microsoft Windows XP Professional Version 2002 Service Pack 2. VI. Conclusion In this paper, we proposed a novel algorithm, called SVT to improve the coding efficiency of video coders. The main idea of SVT is to vary the position of the transform block. In addition to changing the position of the transform block, the size of the transform can also be varied within the SVT framework. We showed in this paper that by varying the position of the transform block and its size, the characteristics of the prediction error is better localized, and the coding efficiency is improved. The proposed algorithm is studied and implemented in the H.264/AVC framework. We showed that the proposed algorithm achieves on average 5.85% bitrate reduction in a low complexity configuration which represents a low complexity codec with most effective tools for HD video coding which we focus on and on average 4.40% bitrate reduction in high complexity configuration, which represents a high complexity codec with full usage of the tools provided in the standard, respectively. Gains become more significant at medium to high bitrates for most tested sequences and the bitrate reduction may reach up to 13.50%, which makes the proposed algorithm very suitable for future video coding solutions focusing on high fidelity video applications. The subjective quality remains similar when SVT is used while the coded bits are reduced. The decoding complexity remains similar when incorporating SVT in video codecs with the proposed implementation. However, the encoding complexity of SVT can be relatively high because of the need to perform a number of RDO steps to select the best position and size for the transform. To address this issue, a novel fast algorithm was also proposed operating on macroblock and block levels to reduce the encoding complexity of SVT. The proposed fast algorithm first skips testing SVT for macroblocks for which SVT is unlikely to be useful by examining the RD cost of macroblock modes without SVT at a macroblock level. For the remaining macroblocks for which SVT may be useful, the proposed fast algorithm selects available candidate LPs based on the motion difference and utilizes a hierarchical search algorithm to select the best available candidate LP at a block level. A reduction of about 80% in the number of candidate LPs that need to be tested in RDO has been achieved, with only a marginal penalty in coding efficiency. Compared to H.264/AVC, the fast SVT achieves on average 5.31% and 4.03% bitrate reduction for low complexity configuration and high complexity configuration, respectively. Nevertheless, we note that still SVT may influence the data flow to make hardware implementation more difficultly than regular block boundary transform. We believe efficient hardware implementation could be studied further and is an interesting topic. The proposed method can also be potentially used in many other situations, for instance, in inter-layer prediction in scalable video coding (SVC) and inter-view prediction in multiview video coding (MVC). Acknowledgment The authors would like to thank the reviewers and the Associate Editor for their valuable comments which helped improve the manuscript. References [1] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi, Video coding with H.264/AVC: Tools, performance, and complexity, IEEE Circuits Syst. Mag., vol. 4, no. 1, pp. 7 28, Apr [2] C. Zhang, K. Ugur, J. Lainema, and M. Gabbouj, Video coding using spatially varying transform, in Proc. 3rd PSIVT, Jan. 2009, pp [3] C. Zhang, K. Ugur, J. Lainema, and M. Gabbouj, Video coding using variable block-size spatially varying transforms, in Proc. IEEE ICASSP, Apr. 2009, pp [4] B. Zeng and J. Fu, Directional discrete cosine transforms: A new framework for image coding, IEEE Trans. Circuits, Syst. Video Technol., vol. 18, no. 3, pp , Mar [5] Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec. H.264 ISO/IEC IS v3, 2005.

140 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 [6] A. Nosratinia, Denoising JPEG images by re-application of JPEG, in Proc. IEEE Workshop MMSP, Dec.

[8] J. Katto, J. Suzuki, S. Itagaki, S. Sakaida, and K. Iguchi, Denoising intra-coded moving pictures using motion estimation and pixel shift, in Proc. IEEE ICASSP, Mar. 2008, pp. 1393 1396. [9] O. G.

Hallapuro, and M. Gabbouj, Low complexity algorithms for spatially varying transform, in Proc. PCS, May 2009. [11] S. Naito and A.

Wien, Variable block-size transforms for H.264/AVC, IEEE Trans. Circuits, Syst. Video Technol., vol. 13, no. 7, pp. 604 613, Jul. 2003. [13] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L.

Kuo, High-definition video coding with supermacroblocks, in Proc. SPIE Vis. Commun. Image Process., vol. 6508, 650816. Jan. 2007, pp. 1 12. [15] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G.

158 140 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 2, FEBRUARY 2011 [6] A. Nosratinia, Denoising JPEG images by re-application of JPEG, in Proc. IEEE Workshop MMSP, Dec. 1998, pp [7] R. Samadani, A. Sundararajan, and A. Said, Deringing and deblocking DCT compression artifacts with efficient shifted transforms, in Proc. IEEE ICIP, Oct. 2004, pp [8] J. Katto, J. Suzuki, S. Itagaki, S. Sakaida, and K. Iguchi, Denoising intra-coded moving pictures using motion estimation and pixel shift, in Proc. IEEE ICASSP, Mar. 2008, pp [9] O. G. Guleryuz, Weighted averaging for denoising with overcomplete dictionaries, IEEE Trans. Image Process., vol. 16, no. 12, pp , Dec [10] C. Zhang, K. Ugur, J. Lainema, A. Hallapuro, and M. Gabbouj, Low complexity algorithms for spatially varying transform, in Proc. PCS, May [11] S. Naito and A. Koike, Efficient coding scheme for super high definition video based on extending H.264 high profile, in Proc. SPIE Vis. Commun. Image Process., vol. 6077, Jan. 2006, pp [12] M. Wien, Variable block-size transforms for H.264/AVC, IEEE Trans. Circuits, Syst. Video Technol., vol. 13, no. 7, pp , Jul [13] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, Lowcomplexity transform and quantization in H.264/AVC, IEEE Trans. Circuits, Syst. Video Technol., vol. 13, no. 7, pp , Jul [14] S. Ma and C.-C. Kuo, High-definition video coding with supermacroblocks, in Proc. SPIE Vis. Commun. Image Process., vol. 6508, Jan. 2007, pp [15] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, Rate-constrained coder control and comparison of video coding standards, IEEE Trans. Circuits, Syst. Video Technol., vol. 13, no. 7, pp , Jul [16] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, Adaptive deblocking filter, IEEE Trans. Circuits, Syst. Video Technol., vol. 13, no. 7, pp , Jul [17] K. Vermeirsch, J. D. Cock, S. Notebaert, P. Lambert, and R. V. de Walle, Evaluation of transform performance when using shape-adaptive partitioning in video coding, in Proc. PCS, May [18] KTA Reference Model 1.8 [Online]. Available: suehring/tml/download/kta/jm11.0kta1.8.zip [19] T. K. Tan, G. Sullivan, and T. Wedi, Recommended simulation common conditions for coding efficiency experiments, document VCEG-AE10, ITU-T Q.6/SG16 VCEG, Jan [20] G. Bjontegaard, Calculation of average PSNR differences between RDcurves, document VCEG-M33, ITU-T SG16/Q6 VCEG, Mar [21] S. Pateux and J. Jung, An excel add-in for computing Bjontegaard metric and its evolution, document VCEG-AE07, ITU-T Q.6/SG16 VCEG, Jan [22] F. Kamisli and J. S. Lim, Transforms for the motion compensation residual, in Proc. IEEE ICASSP, Apr. 2009, pp Cixun Zhang (S 09) received the B.E. and M.E. degrees in information engineering from Zhejiang University, Hangzhou, China, in 2004 and 2006, respectively. He is currently pursuing the Ph.D. degree from the Department of Signal Processing, Tampere University of Technology, Tampere, Finland. Since August 2006, he has been a Researcher with the Department of Signal Processing, Tampere University of Technology, Tampere. His current research interests include video, image coding, and communication. He is currently working on standardization of high efficiency video coding standard. Mr. Zhang was the recipient of the AVS Special Award from the AVS Working Group of China in 2005, the Nokia Foundation Scholarship in 2008, and the Chinese Government Award for Outstanding Self-Financed Students Abroad in Kemal Ugur received the M.S. degree in electrical and computer engineering from the University of British Columbia, Vancouver, BC, Canada, in 2003, and the Ph.D. degree from the Tampere University of Technology, Tampere, Finland, in He was with Nokia Research Center, Tampere, in 2004, where, currently, he is a Principal Researcher and leading a project on next generation video coding technologies. Since he joined Nokia, he has been actively participating in several standardization forums, such as Joint Video Team for the standardization of multiview video coding extension of H.264/AVC, Video Coding Experts Group for exploration toward next generation video coding standard, 3GPP for mobile broadcast and multicast standard, and, recently, Joint Collaborative Team on Video Coding for standardization of high efficiency video coding standard. He has more than 25 publications in academic conferences and journals and around 30 patent applications. Mr. Ugur is a member of the research team that won the highly prestigious Nokia Quality Award in Jani Lainema received the M.S. degree in computer science from the Tampere University of Technology, Tampere, Finland, in He was with the Visual Communications Laboratory, Nokia Research Center, Tampere, in Since then he has contributed to the designs of ITU-T and MPEG video coding standards as well as to the evolution of different multimedia service standards in 3GPP, digital video broadcasting, and digital living network alliance. Recently, he has been working as a Principal Scientist with Nokia Research Center. His current research interests include video, image and graphics coding, and communications, as well as practical applications of game theory. Antti Hallapuro received the M.S. degree in computer science from Tampere University of Technology, Tampere, Finland, in He was with Nokia Research Center, Tampere, in 1998, where he has been working on video coding-related topics. He participated in H.264/AVC video codec standardization and is the author and co-author of several input documents and related academic papers. He has contributed significantly to productization of high performance H.264/AVC codecs for various computing platforms. His current research interests include practical video coding and processing algorithms and their high performance implementations. He is currently working on topics of next generation video-coded standardization. Moncef Gabbouj (S 86 M 91 SM 95 F 11) received the B.S. degree in electrical engineering from Oklahoma State University, Stillwater, in 1985, and the M.S. and Ph.D. degrees in electrical engineering from Purdue University, West Lafayette, IN, in 1986 and 1989, respectively. He has been an Academy Professor with the Academy of Finland, Helsinki, Finland, since January He was a Professor with the Department of Signal Processing, Tampere University of Technology, Tampere, Finland. He was the Head of the department from 2002 to He was a Visiting Professor with the American University of Sharjah, Sharjah, UAE, from 2007 to 2008, and a Senior Research Fellow with the Academy of Finland from 1997 to 1998 and from 2007 to His current research interests include multimedia contentbased analysis, indexing and retrieval, nonlinear signal and image processing and analysis, voice conversion, and video processing and coding. Dr. Gabbouj served as the Distinguished Lecturer for the IEEE Circuits and Systems Society from 2004 to He served as an Associate Editor of the IEEE Transactions on Image Processing, and was a Guest Editor of Multimedia Tools and Applications, European Journal Applied Signal Processing. He was the Past Chairman of the IEEE Finland Section, the IEEE CAS Society, the Technical Committee on DSP, and the IEEE SP/CAS Finland Chapter. He was the recipient of the 2005 Nokia Foundation Recognition Award. He was the co-recipient of the Myril B. Reed Best Paper Award from the 32nd Midwest Symposium on Circuits and Systems and the co-recipient of the NORSIG 94 Best Paper Award from the 1994 Nordic Signal Processing Symposium. He is a member of the IEEE Signal Processing Society and the IEEE Circuits and Systems Society.

159 Publication 6 Cixun Zhang, Kemal Ugur, Jani Lainema, Moncef Gabbouj, Video Coding Using Spatially Varying Transform Proceedings Pacific-Rim Symposium on Image and Video Technology (PSIVT), Tokyo, Japan, 13th-16th, Jan 2009, pp

160

161 Video Coding Using Spatially Varying Transform Cixun Zhang 1, Kemal Ugur 2, Jani Lainema 2, and Moncef Gabbouj 1 1 Tampere University of Technology, Tampere, Finland {cixun.zhang,moncef.gabbouj}@tut.fi 2 Nokia Research Center, Tampere, Finland {kemal.ugur,jani.lainema}@nokia.com Abstract. In this paper, we propose a novel algorithm, named as Spatially Varying Transform (SVT). The basic idea of SVT is that we do not restrict the transform coding inside normal block boundary but adjust it to the characteristics of the prediction error. With this flexibility, we are able to achieve coding efficiency improvement by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. The proposed algorithm is implemented and studied in the H.264/AVC framework. We show that the proposed algorithm achieves 2.64% bit-rate reduction compared to H.264/AVC on average over a wide range of test set. Gains become more significant at high bit-rates and the bitrate reduction can be up to 10.22%, which makes the proposed algorithm very suitable for future video coding solutions focusing on high fidelity applications. The decoding complexity is expected to be decreased because only a portion of the prediction error needs to be decoded. Keywords: H.264/AVC, video coding, transform, spatially varying transform (SVT). 1 Introduction H.264/AVC (H.264 for short hereafter) is the latest international video coding standard and it provides up to 50% gain in coding efficiency compared to previous standards. However, this is achieved at the cost of both increased encoding and decoding complexity. It is estimated in [1] that the encoder complexity increases with more than one order of magnitude between MPEG-4 Part 2 (Simple Profile) and H.264 (Main Profile) and with a factor of 2 for the decoder. For mobile video services (video telephony, mobile TV etc.) and handheld consumer electronics (digital still cameras, camcorders etc), additional complexity of H.264 becomes an issue due to the limited resources of these devices. On the other hand, as display resolutions and available bandwidth/storage increases rapidly, High-Definition (HD) video is becoming more popular and commonly used, making the implementation of video codecs even more challenging. To better satisfy the requirements of increased usage of HD video in resource constrained applications, two key issues should be addressed: coding efficiency and implementation complexity. In this paper, we propose a novel algorithm, named as Spatially Varying Transform (SVT), which provides coding efficiency gains over T. Wada, F. Huang, and S. Lin (Eds.): PSIVT 2009, LNCS 5414, pp , Springer-Verlag Berlin Heidelberg 2009

162 Video Coding Using Spatially Varying Transform 797 H.264 and is expected to lower the decoding complexity. The technique is developed and studied mainly for coding HD resolution video, but it could be extended also for other resolutions. The motivations leading to design of SVT are two-fold: 1. The block based transform design in most existing video coding standards does not align the underlying transform with the possible edge location. In this case, the coding efficiency decreases. In [2], directional discrete cosine transforms is proposed to improve the efficiency of transform coding for directional edges. However, efficient coding of horizontal/vertical edges inside the blocks and non-directional edges was not addressed. 2. Coding the entire prediction error signal may not be the best in terms of rate distortion tradeoff. An example is the SKIP mode in H.264 [3], which does not code the prediction error at all. The basic idea of SVT is that we do not restrict the transform coding inside normal block boundary but adjust it to the characteristics of the prediction error. With this flexibility, we are able to achieve coding efficiency improvement by selecting and coding the best portion of the prediction error in terms of rate distortion tradeoff. This is done by searching inside a certain residual region after intra prediction or motion compensation, for a sub-region and only coding this sub-region. The location parameter of the sub-region inside the region is coded into the bitstream if there are non-zero coefficients. The proposed algorithm is implemented and studied in H.264 framework. Extensive experimental results show that it can improve the coding efficiency of H.264. In addition decoding complexity is expected to be lowered a little mainly because only a portion of the prediction error needs to be decoded. Encoding complexity of the proposed technique is higher mainly due to the brute force search process. Fast encoding algorithms are being studied to alleviate this aspect of the proposed technique. The paper is organized as follows: The proposed algorithm is introduced in section 2 and its integration into H.264 framework is described in section 3. Experimental results are given in section 4. Section 5 concludes the paper and also presents future research directions. 2 Spatially Varying Transform The basic idea of SVT is that the transform coding is not restricted inside normal block boundary but applied to a portion of the prediction error according to the characteristics of the prediction error. We only code a sub-region in a certain residual region after intra prediction or motion compensation. The sub-region is found by searching inside the region according to a certain criterion. Information of the location of the selected sub-region inside the region is coded into the bitstream, if there are non-zero coefficients. Fig. 1 shows an illustrative example of the idea: one 8x8 block inside a 16x16 macroblock is selected and only this 8x8 block is coded. In this paper, we focus our discussion on this particular configuration, which turns out to be promising, as we will see later. However, we note that there is no restriction of the sub-region and region, for example, on their size, shape, etc when using the idea

163 798 C. Zhang et al. in a general sense. Other possible configurations of the idea to achieve further gain in coding efficiency are under study. In the following, we further discuss two key issues of SVT in more detail: selection of location parameter candidates and filtering of block boundaries. Fig. 1. Illustration of spatially varying transform 2.1 Selection of Location Parameter Candidates When there are non-zero coefficients of the selected 8x8 block, its location inside the macroblock needs to be coded and transmitted to the decoder. As shown in Fig. 1, the location of the selected 8x8 block inside the current macroblock is denoted by (Δx, Δy) where Δx and Δy each can take integer value from 0 to 8, if the selected block is restricted to have the same size (which facilitates the transform design) for all locations. There are in total 81 possible combinations and we need to select the best one according to a certain criterion. In this paper, Rate-Distortion Optimization (RDO) is used to select the best (Δx, Δy) in terms of RD tradeoff by minimizing the following: J = D + λ R. (1) where J is the RD cost of the selected combination, D is the distortion, R is the bit rate and λ is the Lagrangian multiplier. The reconstruction residue for the remaining part of the 16x16 residual macroblock is simply set to be 0 in our implementation, but different values can be used and might be beneficial in certain cases (luminance change, etc). Similarly, RDO can also be used to decide if SVT should be used for a macroblock. Selection of location parameter candidates is important since it directly affects the encoding complexity and the performance of SVT. We study the frequency distribution of (Δx, Δy) and it is observed that the most frequently selected (Δx, Δy), are (0..8,0), (0..8,8), (0,1..7), (8,1..7) 1, which takes up a percentage around 60% of all 81 combinations. According to extensive experiments, this is generally true for 1 In this paper, notation x..y is used to specify a range of integer values starting from x to y inclusive, with x, y being integer numbers.

164 Video Coding Using Spatially Varying Transform 799 different sequences, macroblock partitions and Quantization Parameters (QP). Fig. 2 below shows the distributions of (Δx, Δy) for different macroblock partitions of BigShips sequence when QP equal to 23. As we will see in section 4, using this subset of location parameters turns out to be an efficient configuration of the proposed algorithm. Fig. 2. Frequency distribution of (Δx, Δy) for different macroblock partitions of BigShips sequence, at QP=23 (Z axis denotes the frequency) 2.2 Filtering of Block Boundaries Due to the coding (transform and quantization) of the selected 8x8 block, blocking artifacts may appear around its boundary with the remaining non-coded part of the macroblock. A deblocking filter can be applied to improve the subjective quality and possibly also the objective quality. An example in the framework of H.264 will be described in detail later in section Integration of Spatially Varying Transform into H.264 Framework In this paper, we study the proposed technique in H.264 framework. Fig. 3 below is the block diagram of extended H.264 encoder with SVT. As shown in Fig. 3, encoder needs to search the best 8x8 block inside macroblocks that use SVT, which is marked as SVT Search in the diagram. Then encoder decides whether to use SVT for the current macroblock, using RDO in our implementation. The location parameter is

Video Coding Using Spatially Varying Transform

Video Coding Using Spatially Varying Transform Cixun Zhang 1, Kemal Ugur 2, Jani Lainema 2, and Moncef Gabbouj 1 1 Tampere University of Technology, Tampere, Finland {cixun.zhang,moncef.gabbouj}@tut.fi