H.264/AVC 動画像処理における高速アルゴリズムおよびハードウェア設計に関する研究

Size: px

Start display at page:

Download "H.264/AVC 動画像処理における高速アルゴリズムおよびハードウェア設計に関する研究"

Erik Joseph Patterson
6 years ago
Views:

1 Fast Algorithms and Hardware Architectures for H.264/AVC H.264/AVC 動画像処理における高速アルゴリズムおよびハードウェア設計に関する研究 2007 年 5 月早稲田大学大学院情報生産システム研究科情報生産システム工学専攻システム応用 LSI 研究李凌峰

2 Contents Chapter 1 Introduction Background History of Video Compression Algorithms History of Video Compression Standards Organization of This Thesis Chapter 2 Survey on H.264/AVC Introduction Target Applications Features of H.264/AVC...18 Chapter 3 Multiple Reference Frames Motion Estimation Algorithms for H.264/AVC Introduction Rate-Distortion Performance Analysis of Multiple Reference Frames Motion Estimation of H.264/AVC Test Benches and Simulation Configurations Variable block size Fractional-pixel motion estimation Multiple reference frames Conclusion Fast Reference Search and Selection Overview Reference frame search Reference decision Simulation Results Conclusion Feature-Detection-Based Early-Termination Algorithm for Multiple Reference Frame Motion Estimation in H.264/AVC Introduction Proposed Feature-Detection-Based Fast Algorithm Simulation Results Conclusion Fast Multiple Reference Frames Motion Estimation with Transform-domain Analysis Introduction Aliasing-sampling Transform domain analysis Proposed algorithm Simulation Results Conclusion Conclusion

3 Chapter 4 Architecture Design of New Coding Tools Introduction Architecture Design of In-Loop Deblocking Filter Engine for H.264/AVC Introduction Deblocking Filter Algorithm in H.264/AVC Architecture Design of Deblocking Filter Implementation Results and Comparisons Conclusion Architecture Design of CABAC Codec Engine for H.264/AVC Introduction CABAC Encoding and Decoding Algorithms in H.264/AVC Architecture Design Implementation Results Conclusion Chapter 5 Conclusion Reference

4 Chapter 1 Introduction 1.1 Background The history of civilization is also the history of how human processes and transmits information. From a rope with knots to Morse code, from abacuses to computers, human has been obtaining more capabilities and freedom to understand nature and to develop civilization. When it comes to the age of information explosion with extraordinary advancements of information technology, much more information represented with different media is being transmitted much faster by much more different means. Among those different media, vision information (picture or video) are becoming increasingly important and has accounted for more than 75% of the information that people obtain from the outside. Therefore, the sciences and technologies of processing and transmitting picture or video signals are taking a very critical role for today s research. Like other formats of information, vision information is usually digitized before processing and transmission. The digital presentation of serial moving picture signal (also referred to as video signal or video sequence in this thesis) is fulfilled by digitalization, which is a regular sampling of video signal in spatial domain as well as in temporal domain. The digitization of video signal provides the tremendous convenience for storing, processing, and transmission of video, but in the meantime, the challenges 3

5 arise from the huge volume of the data after digitization. Taking next-generation HDTV format as an example, the bit-rate of raw HDTV video signal reaches more than 1Gbits/s. Therefore, the compression of video signals is very necessary to save storage space as well as the bandwidth of communication channel, which, as a result, has always been a crucial problem attracting attentions form either academia or industry. In 2003, H.264/AVC, the latest video compression standards, was developed, and it was proven that H.264/AVC achieves a 50% bit-rate saving for equivalent perceptual quality relative to the performance of prior standards, such as MPEG-2 or H.263 [1][2]. Not surprisingly, the compression efficiency improvement is obtained at the cost of an inevitable complexity increase. Software-based simulation results show that the complexity increase is more than one order of magnitude at the encoder and a factor 2 for the decoder [5]. As a result, it is crucial to find fast algorithms for practical implementation of H.264/AVC. On the other hand, the advance of Very-Large-Scale Integration (VLSI) makes the state-of-the-art integrated circuits have increasing capacity to handle more complex video compression algorithms. It is now feasible to put sophisticated compression processes on a relatively low-cost single chip; this has spurred a great deal of activity in developing multimedia systems for the large consumer market. However, even for today s semiconductor technology, the implementation of H.264/AVC is not an easy task. First, the significant complexity increase requires more complicated architectures to be designed and implemented for the new algorithms. Second, a huge bandwidth of out-chip memory access is required for encoder. Taking real-time HDTV encoding as an example, 4

6 it is estimated that the out-chip bandwidth requirement is more than 1Gbits/s. Third, data dependency and control dependency in the original algorithms make it hard to exploit the temporal parallelism (pipeline) or spatial parallelism (multiple processing units) hardware architecture. Finally, some particular parts in the standards introduce the operations that require intensive and irregular data accesses, which also cause the difficulties in accelerating the throughput of a codec system. Therefore, along with the development of video compression standards, the corresponding hardware architecture design has always been a crucial and challenging research area. In summary, two critical problems have to be taken into consideration to obtain the practical solutions. The first problem is how to optimize the original algorithms. By resolving this problem, low-complexity algorithms should be proposed to speed up the codec throughput and keep the R-D performance loss as small as possible. The second problem is to explore the efficient hardware architectures for different platforms. This thesis works on those two aspects, proposing efficient algorithms and architectures for H.264/AVC. This thesis mainly addresses two topics. The first topic is study on multiple-reference-frame motion estimation algorithm in H.264/AVC, which is crucial in H.264/AVC encoding algorithm, and accounts for more than 90% of computational complexity of a software encoder. Major efforts are made to simplify the original brute-force algorithm to reduce the complexity and keep the loss of compression efficiency as small as possible. The second topic is the hardware architecture design of new coding tools for H.264/AVC. The architectures for two coding tools are proposed, 5

7 which include in-loop deblocking filter and Context-Based Adaptive Binary Arithmetic Coding (CABAC) codec, which are two new coding tools introduced in H.264 and both raise the challenges for hardware architecture design due to the complexities of algorithms History of Video Compression Algorithms In 1948, Shannon, the father of information theory, published his work focusing on the problem of how best to encode the information a sender wants to transmit, and how best to utilize the channel for the transmission [6]. After that, he gave out a serial of coding theorems, which partitioned the studies of communications into two areas: source coding and channel coding. For the former, major researches are conducted to find the best way to represent the information source, in another word, compression coding. According to whether the decoder can reconstruct the original data with or without loss, the compression methods are classified into lossy compression and lossless compression. Either of them has their own characteristics and is employed in different applications. In some cases, they also are utilized together. The compression usually is accomplished by a coding procedure. As one of the lossless compression methods, entropy coding compresses source data by utilizing the statistic redundancy among the data. Before 70 s, the theories of entropy coding [7], predictive coding [8] and transform coding [9] have been developed in the area of source coding. Some combination schemes of those coding methods were also proposed. Ref[10] referred to those combination 6

8 schemes as hybrid coding. In 80 s, some hybrid coding systems are proposed for video compression, in which transform coding, predictive coding, adaptive quantization, and entropy coding were combined together [11][12]. The major motivations include: to remove the spatial correlations by utilizing transform coding; to impose different quantization on the blocks in different position by utilizing the characteristics of vision psychology; to remove the temporal correlations by utilizing predictive coding; to encode the quantized coefficients with variable-length coding/code (VLC). In 1985, Musmann published a survey on the video coding studies of that time including the advancements of transform coding, predictive coding, adaptive quantization and motion estimation (ME) [13]. In that phase, two different hybrid coding schemes were proposed [14]. One is hybrid spatial and temporal compression, which firstly performs transform coding to remove the spatial correlations, and then performs differential pulse code modulation (DPCM) in the temporal domain [11][12]. In this scheme, the transform is out of the loop of DPCM. The other scheme is hybrid temporal and spatial compression, which performs the inter-frame prediction to remove the temporal correlations, and then imposes transform coding on the residual data to remove the spatial correlations [15][16]. The second scheme is the nascent form of the subsequent video compression standards. As the final stage of hybrid coding system, entropy coding usually are employed to encode the quantized coefficients. Two different entropy coding schemes are often adopted in video coding systems. The first one is VLC, which assigns short codes to the symbols with small occurring probabilities to remove the redundancy. The first VLC 7

9 scheme can be found in [6], which is called Shannon Code. In 1959, Gilbert and Moore proposed the methods to construct VLC. However, the VLC scheme that is widely used nowadays is the coding scheme proposed by Huffman in 1952, which is called Huffman Code [7]. Another VLC scheme called Golomb-Rice code was proposed for the symbol set with geometric distribution [17][18]. Based on it, Exponential-Golomb code was proposed for the symbol set with exponential distribution. Both of them can be found in today s video compression applications [19]. Another important entropy coding scheme is arithmetic coding. Proposed and early developed in the 1970 s and 1980 s [61][62][64][54], arithmetic coding has been well known as one of the lossless coding approaches that provide optimal compression efficiency for a finite discrete source. By utilizing arithmetic coding instead of VLC in hybrid coding systems, at least three shortcomings of VLC can be overcome: i) coding events with a probability greater than 0.5 cannot be efficiently represented, and hence, a so-called alphabet extension of run symbols representing successive levels with value zero is used in the entropy coding schemes of MPEG-2, H.263, and MPEG-4; ii) the usage of fixed VLC tables does not allow an adaptation to the actual symbol statistics, which may vary over space and time as well as for different source material and coding conditions; iii) since there is a fixed assignment of VLC tables and syntax elements, existing inter-symbol redundancies cannot be exploited within these coding schemes. However, implementation complexity has been always a critical issue, which makes it difficult to employ arithmetic coding for the applications requiring high throughput, 8

10 such as real-time video compression. To reduce the algorithm complexity and speed up the processing, many works have been done and been published in literature [60][63][52]. Among those low-complexity coding schemes, the Q coder [60] and its derivatives QM and MQ coder provided practical approaches for real-time applications [66]. Particularly, MQ coder was adopted by JPEG 2000 as the arithmetic coding tool. However, unlike the scenario in JPEG 2000, MQ coder cannot provide high coding efficiency for the applications of video coding [57]. Therefore, in the latest video coding standard, H.264/AVC, a different arithmetic coding scheme is adopted [56] History of Video Compression Standards ITU-T VCEG (ITUT-T Video Coding Expert Group) and ISO MPEG (ISO Moving Picture Experts Group) are two most important international standardization groups. ITU-T VCEG s works include H.261, H.262, H.263, H.26L (the early version of H.264/AVC), and etc. H.261 and H.263 target low bit-rate real-time video communications; H.262 is a collaborative work with MPEG-2. The standards that have been developed by ISO MPEG include MPEG-1, MPEG-2, MPEG-4, and etc. MPEG-1 targets video digital storage; MPEG-2 (H.262) targets digital television; MPEG-4 is a standard that includes many encoding tools to support a very wide range of applications, such as the H.263-compatible low bit-rate video communications, the encoding tools supporting 3-dimensional video encoding, the fine granularity scalability (FGS) tools for video streaming applications, and MPEG-4 part-10 (H.264) for various applications with 9

11 different bit-rates. The roadmap of those international video compression standards is shown in Figure 1Figure 1. In 1988, CCITT (today s ITU-T) standardized H.261 targeting 64kbit/s video phone and video conference [20]. The major techniques were adopted in H.261 include: inter-frame prediction based on motion estimation and compensation, which is utilized to exploit the temporal correlations of video pictures; 2-dimensional DCT on prediction residual blocks, which is utilized to exploit the spatial correlations; adaptive quantization for DCT transform coefficients; Run-level coding and Huffman VLC coding for quantized DCT coefficients. ITU-T: H.261 H.263 H.263+ H H.26L Joint Standards: H.262/ MPEG2 H.264/ AVC MPEG: MPEG-1 MPEG Figure 1. Roadmap of video compression standards After the standardization of H.261, ISO MPEG standardized the drafts of MPEG-1 [21] and MPEG-2 [22] in November, 1991 and November, 1993 respectively, which were finalized as international standards in November, 1993, and November, 1994 respectively. MPEG-1 has the similar framework with H.261, but some new coding tools 10

12 were introduced. Besides intra prediction picture (I picture) and previously prediction picture (P picture) in H.261, bi-directional prediction picture (B picture) was introduced, which utilizes inter-frame interpolation with motion compensation to reach higher compression efficiency. In addition to B picture, motion estimation with half-pixel accuracy and quantization matrix for intra-frame coding were introduced. Based on MPEG-1, MPEG-2 introduced the following new techniques: two different prediction modes (frame prediction and field prediction) to fit the existing video capture and display equipments in television applications; different scan schemes and VLC tables for different residual blocks; the principles of Profile and Level (Main Profile@Main Level is frequent used for today s applications with resolutions of or , and with frame frequencies of 25frame/s or 30 frame/s. In 1995, ITU-T finalized H.263 [23] and introduced the following techniques, which are not included in H.261: half-pixel accurate motion vector (MV); 3-dimensinal VLC tables (with indices of Run, Level, and Last) instead of previous 2-dimensianl VLC tables ( with indices of Run, and Level); unrestricted motion vector (UMV) option; optional arithmetic coding scheme for higher compression efficiency; advanced prediction mode, in which each 8 8 block has its own motion vector, which reduces prediction errors at the cost of that more bits are used to encode motion vectors; optional PB-frames mode[]. After the standardization of H.263, ITU-T VCEG conducted the subsequent development under two different plans. Following the short-term plan, H.263+ [24] and H [25] were developed with some extra options for functional extension, error resiliency and 11

13 higher compression efficiency. The other long-term plan intended to develop a new standard for low bit-rate video communications, which was referred to as H.263L, and renamed as H.26L in MPEG-4 [26] absorbs many of the features of H.263, MPEG-1 and MPEG-2, adding new coding tools, such as (extended) virtual reality modeling language (VRML) support for 3D rendering, video-object coding based on 2-dimentinal model with arbitrary shape and 3-dimentional wire-frame model, coding tools for FGS, support for externally-specified digital rights management and various types of interactivity. In November, 2001, ISO MPEG and ITU-T VCEG launched a new joint project and the corresponding working group, Joint Video Team (JVT). Those two leading standards bodies intended to develop a new video compression standard, based on the H.26L. This new standard was expected to bring substantial improvements in video coding efficiency, and to be of use in all areas where bandwidth or storage capacity is limited. And this standard is referred to as ISO MPEG-4 Part 10 AVC in ISO MPEG and as ITU-T H.264 in ITU-T VCEG (referred to as H.264/AVC in the rest of this thesis) [1]. H.264/AVC adopted many new coding tools including: seven different inter-prediction block modes to fit the different details distributions of video signals; multiple reference frame motion estimation and quart-pixel accurate motion vectors; in-loop deblocking filter to eliminate block artifacts; and two different context-based adaptive entropy coding schemes. H.264/AVC is the latest entry in the series of international video coding standards. It is currently the most powerful and state-of-the-art standard. As has been the case with 12

14 past standards, its design provides the most current balance between the coding efficiency, implementation complexity, and cost based on state of VLSI design technology (CPU's, DSP's, ASIC's, FPGA's, etc.). In the process, a standard was created that improved coding efficiency by a factor of at least about two (on average) over MPEG-2 - the most widely used video coding standard today - while keeping the cost within an acceptable range. In July, 2004, a new amendment was added to this standard, called the Fidelity Range Extensions (FRExt, Amendment 1), which demonstrates even further coding efficiency against MPEG-2, potentially by as much as 3:1 for some key applications. 1.2 Organization of This Thesis As mentioned in Section 1.1, the emergency of the latest video compression standards raises the challenges of efficient implementations, due to the significant increase of complexity. In order to archive a practical and efficient hardware implementation of H.264/AVC, efforts need to be made on at least two aspects. One is how to reduce the complexity of original algorithms, and keep the compression efficiency loss as small as possible; the other is the designs of highly efficient hardware architectures for the new algorithms in a certain application scenario. This thesis focuses on both those two aspects and is organized as the following. Chapter 1 is the introduction of this thesis, which introduces the background of video compression algorithms and video compression standards. Chapter 2 is a survey of H.264/AVC, which introduces the major features and newly introduced coding tools of 13

15 H.264/AVC. Chapter 3 addresses the works on fast multiple reference frames motion estimation algorithms. Section 3.2 gives out a rate-distortion performance analysis of motion estimation of H.264/AVC. The analysis investigates the impacts of important parameters, such as accuracy of motion vector, number of block modes, and maximum number of reference frames. The analysis also provides the hints for new proposals. Section 3.2, Section 0,and Section 3.4 addresses three different approaches to simplify the original multiple reference frames motion estimation algorithms. Conclusions of Chapter 3 are given out in Section 3.6. Chapter 4 addresses hardware architecture design of two import new coding tools in H.264/AVC. A highly efficient architecture of in-loop deblocking filter is proposed in Section Chapter 4. A multipurpose architecture for CABAC codec engine is proposed in Section 4.3. Chapter 5 summarizes this thesis. 14

16 Chapter 2 Survey on H.264/AVC 2.1 Introduction Since the early 1990 s, when the technology was in its infancy, international video coding standards - H.261, MPEG-1, MPEG-2 / H.262, H.263, and MPEG-4 - have been the engines behind the commercial success of digital video compression. They have played pivotal roles in spreading the technology by providing the power of interoperability among products developed by different manufacturers, while at the same time allowing enough flexibility for ingenuity in optimizing and molding the technology to fit a given application and making the cost-performance trade-offs best suited to particular requirements. They have provided much-needed assurance to the content creators that their content will run everywhere and they do not have to create and manage multiple copies of the same content to match the products of different manufacturers. They have allowed the economy of scale to allow steep reduction in cost for the masses to be able to afford the technology. They have nurtured open interactions among experts from different companies to promote innovation and to keep pace with the implementation technology and the needs of the applications [3]. These previous standards reflect the technological progress in video compression and the adaptation of video coding to different applications and networks. Applications range 15

17 from video telephone to consumer video and broadcast of standard definition or high definition TV. Networks used for video communications include switched networks such as PSTN (H.263, MPEG-4) or ISDN (H.261) and packet networks like ATM (MPEG-2, MPEG-4), the Internet (H.263, MPEG-4) or mobile networks (H.263, MPEG-4). The importance of new network access technologies like cable modem, xdsl, and UMTS created demand for the new video coding standard H.264/AVC, providing enhanced video compression performance in view of interactive applications like video telephony requiring a low latency system and non-interactive applications like storage, broadcast, and streaming of standard definition TV where the focus is on high coding efficiency [4]. Special consideration had to be given to the performance when using error prone network. H.264/AVC was finalized in March 2003 and approved by the ITU-T in May 2003 [1]. Comparing the H.264/AVC video coding tools like multiple reference frames, quarter-pixel motion compensation, deblocking filter or integer transform to the tools of previous video coding standards, H.264/AVC achieved a leap in coding performance. For efficient transmission in different environments not only coding efficiency is relevant, but also the seamless and easy integration of the coded video into all current and future protocol and network architectures. This includes the public Internet, as well as wireless networks expected to be a major application for the new video coding standard. The adaptation of the coded video representation or bit-stream to different transport networks was typically defined in the systems specification in previous MPEG standards or separate standards like H.320 or H.324. However, only the close integration of network 16

18 adaptation and video coding can bring the best possible performance of a video communication system. Therefore H.264/AVC consists of two conceptual layers. The video coding layer (VCL) defines the efficient representation of the video, and the network adaptation layer (NAL) converts the VCL representation into a format suitable for specific transport layers or storage media. For circuit-switched transport like H.320, H.324M or MPEG-2, the NAL delivers the coded video as an ordered stream of bytes containing start codes such that these transport layers and the decoder can robustly and simply identify the structure of the bit stream. For packet switched networks like RTP/IP or TCP/IP, the NAL delivers the coded video in packets without these start codes [4]. 2.2 Target Applications The new standard is designed for technical solutions including at least the following application areas [2]: Broadcast over cable, satellite, cable modem, DSL, terrestrial, etc. Interactive or serial storage on optical and magnetic devices, DVD, etc. Conversational services over ISDN, Ethernet, LAN, DSL,wireless and mobile networks, modems, etc. or mixtures of these. Video-on-demand or multimedia streaming services over ISDN, cable modem, DSL, LAN, wireless networks, etc. Multimedia messaging services (MMS) over ISDN, DSL, ethernet, LAN, wireless and mobile networks, etc. 17

Moreover, new applications may be deployed over existing and future networks. This raises the question about how to handle this variety of applications and networks.

19 Moreover, new applications may be deployed over existing and future networks. This raises the question about how to handle this variety of applications and networks. To address this need for flexibility and customizability, the H.264/AVC design covers a VCL, which is designed to efficiently represent the video content, and a NAL, which formats the VCL representation of the video and provides header information in a manner appropriate for conveyance by a variety of transport layers or storage media Figure 2. Block diagram of H.264/AVC encoder 2.3 Features of H.264/AVC At a basic overview level, the coding structure of this standard is similar to that of all prior major digital video standards (H.261, MPEG-1, MPEG-2/H.262, H.263 or MPEG-4). The architecture and the core building blocks of the encoder are shown in Figure 2, indicating that it is also based on motion-compensated DCT-like transform coding. Each picture is compressed by partitioning it as one or more slices; each slice 18

20 consists of macroblocks, which are blocks of luma samples with corresponding chroma samples. However, each macroblock is also divided into sub-macroblock partitions for motion-compensated prediction. The prediction partitions can have seven different sizes 16 16, 16 8, 8 16, 8 8, 8 4, 4 8 and 4 4. In past standards, motion compensation used entire macroblocks or, in the case of newer designs, or 8 8 partitions, so the larger variety of partition shapes provides enhanced prediction accuracy. The spatial transform for the residual data is then either 8 8 (a size supported only in FRExt) or 4 4. In past major standards, the transform block size has always been 8 8, so the 4 4 block size provides an enhanced specificity in locating residual difference signals. The block size used for the spatial transform is always either the same or smaller than the block size used for prediction. In addition, there may be additional structures such as packetization schemes, channel codes, etc., which relate to the delivery of the video data, not to mention other data streams such as audio. As the video compression tools primarily work at or below the slice layer, bits associated with the slice layer and below are identified as Video Coding Layer (VCL) and bits associated with higher layers are identified as Network Abstraction Layer (NAL) data. VCL data and the highest levels of NAL data can be sent together as part of one single bit-stream or can be sent separately. The NAL is designed to fit a variety of delivery frameworks (e.g., broadcast, wireless, storage media). Herein, we only discuss the VCL, which is the heart of the compression capability. While an encoder block diagram is shown in Figure 2, the decoder conceptually works in reverse, 19

21 comprising primarily an entropy decoder and the processing elements of the region shaded in Figure 2. Relative to prior video coding methods, such as MPEG-2 video, some highlighted features of the design that enable enhanced coding efficiency include the following enhancements of the ability to predict the values of the content of a picture to be encoded. Variable block-size motion compensation with small block sizes: This standard supports more flexibility in the selection of motion compensation block sizes and shapes than any previous standard, with a minimum luma motion compensation block size as small as 4 4. Quarter-sample-accurate motion compensation: Most prior standards enable half-sample motion vector accuracy at most. The new design improves up on this by adding quarter-sample motion vector accuracy, as first found in an advanced profile of MPEG-4, but further reduces the complexity of the interpolation processing compared to the prior design. Motion vectors over picture boundaries: While motion vectors in MPEG-2 and its predecessors were required to point only to areas within the previously-decoded reference picture, the picture boundary extrapolation technique first found as an optional feature in H.263 is included in H.264/AVC. Multiple reference picture motion compensation: previously prediction picture (P pictures) in MPEG-2 and its predecessors used only one previous picture to predict the values in an incoming picture. The new design extends upon the enhanced reference 20

22 picture selection technique found in to enable efficient coding by allowing an encoder to select, for motion compensation purposes, among a larger number of pictures that have been decoded and stored in the decoder. The same extension of referencing capability is also applied to motion-compensated bi-prediction, which is restricted in MPEG-2 to using two specific pictures only (one of these being the previous I picture or P picture in display order and the other being the next I or P picture in display order). Decoupling of referencing order from display order: In prior standards, there was a strict dependency between the ordering of pictures for motion compensation referencing purposes and the ordering of pictures for display purposes. In H.264/AVC, these restrictions are largely removed, allowing the encoder to choose the ordering of pictures for referencing and display purposes with a high degree of flexibility constrained only by a total memory capacity bound imposed to ensure decoding ability. Removal of the restriction also enables removing the extra delay previously associated with bi-predictive coding. Decoupling of picture representation methods from picture referencing capability: In prior standards, B pictures could not be used as references for prediction of other pictures in the video sequence. By removing this restriction, the new standard provides the encoder more flexibility and, in many cases, an ability to use a picture for referencing that is a closer approximation to the picture being encoded. Weighted prediction: A new innovation in H.264/AVC allows the motion-compensated prediction signal to be weighted and offset by amounts specified by 21

23 the encoder. This can dramatically improve coding efficiency for scenes containing fades, and can be used flexibly for other purposes as well. Improved skipped and direct motion inference: In prior standards, a skipped area of a predictively-coded picture could not motion in the scene content. This had a detrimental effect when coding video containing global motion, so the new H.264/AVC design instead infers motion in skipped areas. For bi-predictively coded areas (called B slices), H.264/AVC also includes an enhanced motion inference method known as direct motion compensation, which improves further on prior direct prediction designs found in and MPEG-4 Visual. Directional spatial prediction for intra coding: A new technique of extrapolating the edges of the previously-decoded parts of the current picture is applied in regions of pictures that are coded as intra (i.e., coded without reference to the content of some other picture). This improves the quality of the prediction signal, and also allows prediction from neighboring areas that were not coded using intra coding (something not enabled when using the transform-domain prediction method found in and MPEG-4 Visual). In- loop deblocking filtering: Block-based video coding produces artifacts known as blocking artifacts. These can originate from both the prediction and residual difference coding stages of the decoding process. Application of an adaptive deblocking filter is a well-known method of improving the resulting video quality, and when designed well, this can improve both objective and subjective video quality. Building further on a concept from an optional feature of, the deblocking filter in the H.264/AVC design is 22

24 brought within the motion-compensated prediction loop, so that this improvement in quality can be used in inter-picture prediction to improve the ability to predict other pictures as well. In addition to improved prediction methods, other parts of the design were also enhanced for improved coding efficiency, including the following. Small block-size transform: All major prior video coding standards used a transform block size of 8 8, while the new H.264/AVC design is based primarily on a 4 4 transform. This allows the encoder to represent signals in a more locally-adaptive fashion, which reduces artifacts known colloquially as ringing. (The smaller block size is also justified partly by the advances in the ability to better predict the content of the video using the techniques noted above, and by the need to provide transform regions with boundaries that correspond to those of the smallest prediction regions.) Hierarchical block transform: While in most cases, using the small 4 4 transform block size is perceptually beneficial, there are some signals that contain sufficient correlation to call for some method of using a representation with longer basis functions. The H.264/AVC standard enables this in two ways: 1) by using a hierarchical transform to extend the effective block size use for low-frequency chroma information to an 8 8 array and 2) by allowing the encoder to select a special coding type for intra coding, enabling extension of the length of the luma transform for low-frequency information to a block size in a manner very similar to that applied to the chroma. Short word-length transform: All prior standard designs have effectively required 23

25 encoders and decoders to use more complex processing for transform computation. While previous designs have generally required 32-bit processing, the H.264/AVC design requires only 16-bit arithmetic. Exact-match inverse transform: In previous video coding standards, the transform used for representing the video was generally specified only within an error tolerance bound, due to the impracticality of obtaining an exact match to the ideal specified inverse transform. As a result, each decoder design would produce slightly different decoded video, causing a drift between encoder and decoder representation of the video and reducing effective video quality. Building on a path laid out as an optional feature in the effort, H.264/AVC is the first standard to achieve exact equality of decoded video content from all decoders. Arithmetic entropy coding: An advanced entropy coding method known as arithmetic coding is included in H.264/AVC. While arithmetic coding was previously found as an optional feature of H.263, a more effective use of this technique is found in H.264/AVC to create a very powerful entropy coding method known as CABAC. Context-adaptive entropy coding: The two entropy coding methods applied in H.264/AVC, termed CAVLC and CABAC, both use context-based adaptivity to improve performance relative to prior standard designs. Robustness to data errors/losses and flexibility for operation over a variety of network environments is enabled by a number of design aspects new to the H.264/AVC standard, including the following highlighted features. 24

26 Parameter set structure: The parameter set design provides for robust and efficient conveyance header information. As the loss of a few key bits of information (such as sequence header or picture header information) could have a severe negative impact on the decoding process when using prior standards, this key information was separated for handling in a more flexible and specialized manner in the H.264/AVC design. NAL unit syntax structure: Each syntax structure in H.264/AVC is placed into a logical data packet called a NAL unit. Rather than forcing a specific bitstream interface to the system as in prior video coding standards, the NAL unit syntax structure allows greater customization of the method of carrying the video content in a manner appropriate for each specific network. Flexible slice size: Unlike the rigid slice structure found in MPEG-2 (which reduces coding efficiency by increasing the quantity of header data and decreasing the effectiveness of prediction), slice sizes in H.264/AVC are highly flexible, as was the case earlier in MPEG-1. Flexible macroblock ordering (FMO): A new ability to partition the picture into regions called slice groups has been developed, with each slice becoming an independently-decodable subset of a slice group. When used effectively, flexible macroblock ordering can significantly enhance robustness to data losses by managing the spatial relationship between the regions that are coded in each slice. (FMO can also be used for a variety of other purposes as well.) 25

27 Arbitrary slice ordering (ASO): Since each slice of a coded picture can be (approximately) decoded independently of the other slices of the picture, the H.264/AVC. 26

28 Chapter 3 Multiple Reference Frames Motion Estimation Algorithms for H.264/AVC 3.1 Introduction Motion estimation (ME) has proven to be effective to exploit the temporal redundancy of video sequences and is therefore a critical part of the video compression algorithms with block-based hybrid coding structures. Motion estimation was firstly adopted in H.261 in In the standard of H.261, motion estimation is of integer-pixel accuracy, for fixed block size (16 16), with one reference frame. After that, more and more complicated motion estimation algorithms were adopted in the subsequent video compression standards. Half-pixel accurate motion estimation was adopted in MPEG-1 and MPEG-2; H.263 and MPEG-4 introduced variable block-size motion estimation (VBSME), in which each 8 8 block can have its own motion vector; MEPG-4 also introduced the motion estimation with higher accuracy, in which quarter-pixel accurate motion vectors are utilized. In H.26L and subsequent H.264/AVC, more complicated VBSME with seven different block modes and multiple reference frames motion estimation were adopted. Along with the advancement of motion estimation algorithms, it can be observed that the prediction error is reduced by increasing the accuracy of block match (from 27

29 integer-pixel, half-pixel, to quarter-pixel); by increasing the number of block sizes (from two different block sizes to seven); and by increasing the number of reference frames (from one reference frame of P picture to two of B picture or multiple reference frames). Either bi-directional prediction or multiple reference frames motion compensations is one simple example of multi-hypothesis motion compensated prediction, which was firstly referred to in [27], and theoretically analyzed in [28]. As one extremely simple approach of multi-hypothesis motion compensated prediction, multiple reference frames motion compensation is adopted in H.264/AVC. 1 macroblock partition of 16*16 luma samples and associated chroma samples 2 macroblock partitions of 16*8 luma samples and associated chroma samples 2 macroblock partitions of 8*16 luma samples and associated chroma samples 4 sub-macroblocks of 8*8 luma samples and associated chroma samples Macroblock partitions sub-macroblock partition of 8*8 luma samples and associated chroma samples 2 sub-macroblock partitions of 8*4 luma samples and associated chroma samples 2 sub-macroblock partitions of 4*8 luma samples and associated chroma samples 4 sub-macroblock partitions of 4*4 luma samples and associated chroma samples Sub-macroblock partitions Figure 3. Macroblock partitions and sub-macroblock partitions In H.264/AVC, motion can be estimated at the macroblock level or by partitioning the macroblock into smaller regions of luma size 16 8, 8 16, 8 8, 8 4, 4 8, or 4 4, as shown in Figure 3. A distinction is made between a macroblock partition, which corresponds to a luma region of size 16 16, 16 8, 8 16, or 8 8, and submacroblock partition, which is a region of size 8 8, 8 4, 4 8, or 4 4. When (and 28

30 only when) the macroblock partition size is 8 8, each macroblock partition can be divided into sub-macroblock partitions. For example, it is possible within a single macroblock to have both 8 8 and 4 8 partitionings, but not 16 8 and 4 8 partitionings. Thus the first row of Figure 3 shows the allowed macroblock partitions, and the sub-macroblock partitions shown in the second row can be selected independently for each 8 8 region, but only when the macroblock partition size is 8 8 (the last partitioning shown in the first row). Figure 4. Multiple reference frames motion estimation A distinct motion vector can be sent for each sub-macroblock partition. The motion can be estimated from multiple pictures that lie either in the past or in the future in display order, as shown in Figure 4. The selection of which reference picture is used is done on the macroblock partition level (so different sub-macroblock partitions within the same macroblock partition will use the same reference picture). A limit on number of pictures used for the motion estimation is specified for each Level. To estimate the motion, pixel values are first interpolated to achieve quarter-pixel accuracy for luma and 29

31 up to 1/8th pixel accuracy for chroma. Interpolation of luma is performed in two steps half-pixel and then quarter-pixel interpolation. Half-pixel values are created by filtering with the kernel [ ]/32, horizontally and/or vertically. Quarter-pixel interpolation for luma is performed by averaging two nearby values (horizontally, vertically, or diagonally) of half pixel accuracy. Chroma motion compensation uses bilinear interpolation with quarter-pixel or one-eighth-pixel accuracy (depending on the chroma format). After interpolation, block-based motion compensation is applied. As noted, however, a variety of block sizes can be considered, and a motion estimation scheme that optimizes the trade-off between the number of bits necessary to represent the video and the fidelity of the result is desirable. Three major new features of motion estimation in H.264/AVC are introduced above. Each of them has contributions to the improvement of compression efficiency, but introduces computational complexity overhead in the mean time. Variable block size affects the access frequency in a linear way: more than 2.5% complexity increase for each additional mode. A typical bit rate reduction between 4 and 20% is achieved (for the same quality) using this tool, however, the complexity increases linearly with the number of modes used, while the corresponding compression gain saturates. The encoder may choose to search for motion vectors only at half-pixel positions instead of quarter-pixel positions. This results in a decrease of access frequency and processing time of about 10%. However, use of quarter-pixel motion vectors increases coding efficiency up to 30% except for very low bit rates. 30

32 Multiple reference frames increases the access frequency according to a linear model: 25% complexity increase for each added frame. A negligible gain (less than 2%) in bit rate is observed for low and medium bit rates, but more significant savings can be achieved for high bit rate sequences (up to 14%). Considering that the large proportion (above 90% [29]) of computational loads accounted for by motion estimation, the complexity of an encoder increases significantly with employing more reference frames. Therefore, multiple reference frames motion estimation raises the obstacles for the implementation of real-time applications. Among the three new major features mentioned above, Chapter 3 focus on the studies of multiple reference frames. Three different approaches are proposed to dynamically control the number of reference frames number, and thus to reduce the computational complexity of original full-search algorithms. Some simplification approaches have been presented [30]-[33]. Ref [30] proposes a set of methods to reduce the reference frame number, but several thresholds are introduced, which need to be empirically decided. The optimal thresholds usually vary with different video sequences/contents, which makes it hard to obtained a practical solution. Ref[31] and Ref[32] employ different early termination methods, but only the correlation between different reference frames are utilized. Ref [33] utilizes the correlations between current block and the neighboring blocks. The best reference frame is predicted from the best reference frames of the neighboring blocks. That prediction method achieves a good time-saving but the inevitable prediction error propagation usually makes the best 31

33 reference frame converge to fixed one or just a few of the possible reference frames, and thus results in degradation of coding efficiency. Ref [34] proposes a feature-detection-based approach to reduce redundant reference frames. The rest of Chapter 3 is organized as the following. Section 3.2 presents simulation results to show the impact of different configurations of those three coding tools. Section 03.4 and 3.5 present three approaches to simplify the original multiple reference frames motion algorithm. Conclusions of Chapter 3 are drawn in Section Rate-Distortion Performance Analysis of Multiple Reference Frames Motion Estimation of H.264/AVC Differently from previous standards, the motion estimation of H.264/AVC achieves a remarkable magnitude of computational complexity by extending on three aspects: the accuracy of motion vector; number of optional block-sizes, and maximum number of reference frames. In order to understand the impacts of different encoding parameters, a rate-distortion (R-D) performance analysis is conducted in Section 3.2. The analysis results also provide the guidelines for the proposals of new algorithms. Rate-distortion performance analysis is widely used to evaluate the compression efficacy. R-D performance is represented with two elements: rate (R) and distortion (D). In video compression field, the bit-rate of coded bit-stream is used as R, and a peak signal-to-noise ratio (PSNR) is used to measure distortion. The PSNR is via the mean 32

34 squared error (MSE) which for two m n monochrome images I and K where one of the images is considered a noisy approximation of the other is defined as: m 1 n 1 1 MSE = I(, i j) Ki (, j) mn i= 0 j= 0 2 The PSNR is defined as: 2 MAX PSNR = 10log 10( I ) MSE Here, MAX I is the maximum pixel value of the image. When the pixels are represented using 8 bits per sample, this is Test Benches and Simulation Configurations Four test sequences with two different resolutions (QCIF and CIF) are used as test sequences. Head with Glasses is a typical head and shoulder sequence. Bus is a test sequence for medium and regular motion. Mobile is a high complexity sequence with medium movements including repetitive motion and rotation. Football is a sequence with medium spatial detail and a lot of movements. Reference software (JM9.2) is employed to perform all the simulations. We set the encoding process under baseline profile, only I and P frames are used. R-D optimization is enabled while rate control is disabled. Therefore, a fixed quantization parameter (QP) is set to achieve a certain target bit-rate. In order to For QCIF sequences, search range is -16 ~ +15, and QP is set from 20 ~ 44; For CIF sequences, search range is -32 ~ +31, and QP is set from 20 ~ 50. CAVLC is used as the entropy 33

35 coding scheme Variable block size Reference software adopts brute-force approach to search all seven block modes, in which different partition modes with different block sizes are evaluated to find out the best one that minimizes R-D cost. The number of optional block-sizes affects the complexity to a great extent. Moreover, it mainly affects fractional pixel ME part. The smaller block size introduces the larger computational complexity. Three tests, referred to as A, B, and C, are conducted. The optional block-sizes for each of them are specified as following: Test A: 16 16, 16 8, 8 16, 8 8, 8 4, 4 8, and 4 4 (all the 7 block modes); Test B: 16 16, 16 8, 8 16, 8 8, 8 4, and 4 8; Test C: 16 16, 16 8, 8 16, and 8 8. Table 3.1 R-D performance comparisons with Test A Test B Test C Sequence SNR Loss (db) Bit-rate Overhead SNR Loss (db) Bit-rate Overhead Head with Glasses QCIF % % Bus QCIF % % Mobile QCIF % % Football QCIF % % Head with Glasses CIF % % Bus CIF % % Mobile CIF 0.03 <0.01% % Football CIF % % According to simulations, small block sizes (less than 8 8) are important for low 34

36 resolution sequences, such as QCIF, but have smaller contributions for CIF or higher resolutions. To make it clear, PSNR loss after omitting 4 4, 8 4, and 4 8 (Test C) and PSNR loss after omitting 4 4 (Test B) are shown in Table 3.1. The effect of small block size decreases with the increase of sequence resolution. For QCIF and larger sequences, 4x4 block size can be omitted, while the larger block modes are required. For CIF and larger sequences, 4 4, 8 4, and 4 8 can be omitted with negligible quality loss, while the larger block modes are required Fractional-pixel motion estimation Table 3.2 R-D performance comparisons with quarter-pixel ME Half-pixel ME Integer-pixel Me Sequence SNR Loss (db) Bit-rate Overhead SNR Loss (db) Bit-rate Overhead Head with Glasses QCIF % % Bus QCIF % % Mobile QCIF % % Football QCIF % % Head with Glasses CIF % % Bus CIF % % Mobile CIF % % Football CIF % % Fractional pixel ME accounts for more than 40% of the complexity of whole encoding procedure. For QCIF, using only integer-pixel ME usually leads to more than 2dB PSNR loss, except football (PSNR loss is about 1.0dB). Omitting quarter-pixel ME usually leads to more than 1dB PSNR loss, except football (PSNR loss is about 0.5dB), as shown in Table 3.2. For CIF, the PSNR losses are slightly less than that of 35

37 QCIF. According to the simulation results, it can be easily observed that fractional-pixel motion estimation is very crucial to keep the compression efficiency, and using only integer-pixel ME usually causes obvious PSNR loss. For some scenes with a lot of noise, such as football, it is possible to use half-pixel ME and omit quarter-pixel ME. In this case, many noise caused by some other reason, such as fast motion, but not by aliasing sampling. Thus, fractional pixel ME cannot improvement the prediction very well Multiple reference frames Table 3.3 R-D performance comparisons when reducing reference frames Reduce 5 Ref. to 2 Ref. Reduce 5 Ref. to 1 Ref. Sequence SNR Loss (db) Bit-rate Overhead SNR Loss (db) Bit-rate Overhead Head with Glasses QCIF % % Bus QCIF % % Mobile QCIF % % Football QCIF % % Head with Glasses CIF % % Bus CIF % % Mobile CIF % % Football CIF % % For a given macroblock or macroblock partition, reference picture can be selected among a large number of previously reconstructed or decoded pictures. The reasons for MRF to achieve better predictions include: repetitive motions, camera shaking or switching back and forth, covering and uncovering movements, lighting or shadow change, and aliasing-sampling. The computational complexity of ME is linear with the 36

38 number of reference frame. PSNR losses and bit-rate overhead after reducing the reference frame number from 5 to 2 or 1 are shown in Table 3.3. Since Mobile is a high complexity sequence with a lot of complicated details as well as a lot of movements, Mobile sequences (QCIF, and CIF) suffer from significant PSNR losses. Based on the experimental results, it can be concluded that two reference frames are enough for many video scenes. Sometimes, five reference frames can only provide 0.05dB of PSNR improvement. Moreover, the effect of MRF is very sensitive to the motion characteristic of scenes. An intuitional idea is to reduce the encoding complexity by changing the reference frame number according to the different video contents Conclusion Chapter 3 presents an analysis on the rate-distortion performances of three major encoding tools in the motion prediction part of H.264/AVC. The simulations are performed over various video contents and different video resolutions. The simulation results show how the effects come with the using of different new encoding tools. In particular, some of them, such as multiple reference frame and variable block size, are important for the trade-off between R-D performance and complexity. Based on the experimental result and analysis, some initial conclusions can be obtained as the following: i) Small block sizes (less than 8 8) can be selectively used especially for the sequences with high resolution. Adaptive algorithm may be a hopeful scheme. 37

39 ii) Omitting fractional-pixel motion estimation usually leads to obvious R-D performance losses; iii) One or two reference frames are enough for many video scenes. Promising content-based algorithms can be designed to dynamically change reference frame number or select reference frame before motion search. In the rest of Chapter 3, three different approaches are proposed to reduce the complexity of original multiple reference frames motion estimation algorithms. 3.3 Fast Reference Search and Selection Overview Table 3.4 Distribution of the best reference frame for 8 8 blocks Sequence Ref0 Ref1 Ref2 Ref3 Ref4 Intra Head with Glasses 93.0% 3.0% 2.0% 0.8% 0.8% 0.4% Bus 70.2% 14.1% 9.4% 3.2% 2.8% 0.3% Mobile 58.7% 13.8% 16.4% 5.6% 5.4% <0.1% Football 76.7% 4.8% 2.9% 1.5% 1.6% 12.6% Reed Field 96.1% 2.1% 0.9% 0.2% 0.3% 0.4% Simulation result in Table 3.4 shows the distribution of the best reference frame for each 8x8 sub-block when maximum reference frame number equals to 5. Ref_0~Ref_4 represent five reference frames, and Ref_0 is the one just before current frame. The distribution indicates i) Ref_0 has the highest probability (59~96%) to be the best reference frame; ii) basically, the closer to the current frame, the reference frame has the higher probability to be the best frame. 38

40 Figure 5. Flow chart of proposed algorithm Since Ref_0 tends to be the best reference frame in most of the cases, the motion estimation referring to Ref_0 is unconditionally performed in proposed algorithm, as shown in the flow chart of proposed algorithm in Figure 5. The unconditional motion estimation in Ref_0 contributes to keep the R-D performance higher than a lowest level. In addition, the results of the motion estimation (motion vectors and motion cost) are utilized to fulfill the subsequent reference search and reference decision. In JM9.2 reference model, motion estimation is performed within all the reference frame candidates in a brute-force manner. The best reference frame is selected by minimizing the following cost function J ( REF λ ) = MOTION SATD( REF ) + λ ( R( m( REF ) p( REF )) + R( REF )) MOTION 39

41 where SATD( REF ) represents the SATD (sum of absolute transformed differences) value corresponding to REF (reference frame); λ MOTION is a Lagrange multiplier; represents the rate cost of m (motion vector) and p (motion vector predictor), which are corresponding to REF; RREF ( ) denotes the rate cost for representing the index of REF. In brief, the best reference frame is selected based on the result of motion search. Therefore, in order to make the reference selection as coincident as possible with the reference model, basic motion estimation approach is applied in the proposed algorithm to fulfill the reference selection Reference frame search Basic motion search scheme is adopted to fulfill the reference search. In this stage, the rough motion costs corresponding to each reference frames except Ref_0 are obtained, which are utilized to fulfill the reference decision in the next stage. In order to reduce the computational complexity, four major modifications are conducted Motion vector accuracy Unlike the quarter-pixel motion estimation in H.264/AVC, only integer-pixel accurate motion search is used in the reference search stage. With the multiple reference frame motion compensation, which is kind of multi-hypothesis motion compensation, it has been proven that sub-pixel accurate motion compensation becomes less important [28]. Although the degradation of motion estimation accuracy from sub-pixel to 40

42 integer-pixel usually leads to a 1~3 db PSNR loss, there are still chances to select the best references from several of them by using only integer-pixel accurate motion estimation Search center In proposed search scheme, it is usually difficult to obtain an accurate motion vector predictor (MVP) from the neighboring blocks and then to utilize it as the search center of current block, since not all of the reference frames are precisely searched, some of them actually are only imprecisely searched to decide whether it is the best reference frame. Following the assumption that objects usually move continuously and smoothly, proposed algorithm utilizes the temporal correlation among the motion vectors in concatenated frames to obtain the MVP and the search center. Figure 6. Generation of search center As shown in Figure 6, F t represents current frame, while F t-1 and F t-2 represent the previous frames. MV t,t-1, MV t-1,t-2 and MV t-1,t-2 are obtained in the motion estimation 41

43 referring to Ref_0 for F t and F t-1 respectively. Apparently, F t-1 is the Ref_0 of F t, while F t-2 is the Ref_0 of F t-1. In addition, MV t,t-1 and MV t-1,t-2 belongs to the corresponding blocks in different frames, while MV t-1,t-2 belongs to the block, to which MV t,t-1 pints to. Considering the assumption of continuous moving of objects, MVP t,t-2 can be described as: MVPtt, 2 = MVtt, 1+ MV ' t 1, t 2 For each block, the MVP for a particular reference frame is used as the search center for the reference search in this frame Search range In this thesis, Motion Compensation Search Range (MCSR) refers to the search range used in the motion compensation in Ref_0 or in the other best reference frame, while Reference Frame Search Range (RFSR) refers to the search range used in the reference search stage. MCSR is usually fixed for a particular application, which affects the encoder performance to a large extent. For a given MCSR, a simulation is performed to decide the optimal RFSR. Table 3.5 shows the distribution of absolute error of MVP. According to the simulation results shown in Table 3.5, it can be found that the prediction error enlarges with the increase of reference index. Therefore, proposed algorithm increases the search range with the reference frame index increasing. 42

44 Table 3.5 Distribution of the absolute error of MVP 1/8MCSR 1/4MCSR 1/2MCSR MCSR Ref_1 85.9% 94.2% 95.6% 96.4% Ref_2 75.9% 89.9% 93.8% 95.2% Ref_3 66.4% 85.4% 91.3% 93.9% Ref_4 57.9% 78.3% 88.4% 92.2% Block size Seven different block sizes/modes are adopted in the motion estimation of H.264/AVC, and different modes or different sub-blocks of a mode can have different best reference frame. However, for the modes that have the block size smaller than 8 8, all the sub-blocks in one 8 8 partition have the same reference frame. Thus, reference search is performed for each 8 8 block Reference decision Figure 7. Cost combination for reference decision 43

45 Reference decision is conducted in two steps. In the first step, a Possible Best Reference Frame (PBR) is selected from the reference frames before Ref_0 for each sub-partition (one 16 16, two 16 8, two 8 16, and four 8 8). As mentioned in 2.2.4, reference search are performed for each 8 8 block, and the motion costs corresponding to different reference frames for each 8 8 block are obtained after the search. Thus, for the partition larger than 8 8, a combination of the 8 8 motion costs is required. As illustrated in Fig.3, C1, C2, C3, and C4 represent the motion costs of four 8 8 blocks in a macroblock. Different levels of C1~C4 represent the motion costs corresponding to different reference frames. Three reference frames are shown in Fig.3. Generally, the PBR for each partition is selected by minimizing J ( REF, λmotion ) = P _ Cost( REF ) + λmotion R( REF ) where P_Cost(REF) is the motion costs (SATD) of the partition and it is corresponding to REF. When select the best reference for block0 of Mode 2, for example, P_Cost(REF) is the sum of C1(REF) and C2(REF). In the second step, motion cost of the prediction from Ref_0 (MCOST_Ref_0) and that from PBR (MCOST_PBR) are compared to decide whether the motion composition in PBR is required. Although the cost of the prediction from PBR is just the result of a comparatively coarse motion search, it can still represent the prediction quality to some extents, since it is calculated with a similar and comparable method. Therefore, these two motion cost shouldn t be evenly compared. In proposed algorithm, the following inequation is utilized to decide if the motion compensation in PBR is needed: 44

46 MCOST _ PBR α MCOST _ Ref where α is multiplier larger than 1. When the above inequation holds, which means Ref_0 is the best reference frame, motion compensation in PBR is omitted, as shown in Figure 5. The second step actually serves as an early termination decision, and only two of the reference frames are fully searched in the worst cases Simulation Results Table 3.6 R-D performance and time saving for QCIF sequences Sequence SNR Loss (db) Bit-rate Overhead Time Saving Head with Glasses % 39.4% Canoa % 34.7% Husky % 33.9% Football % 35.4% Car Phone % 36.1% * MCSR=16, QP=24, 28, 32, 36, 40 Table 3.7 R-D performance and time saving for CIF sequences Sequence SNR Loss (db) Bit-rate Overhead Time Saving Tempete % 35.5% News % 47.7% Mobile % 35.0% Football 0.04 < 1% 37.5% Coast Guard 0.05 < 1% 38.0% * MCSR=32, QP=20, 25, 30, 35, 40 Proposed algorithm is implemented in JM 9.2 reference model. Maximum reference frame number is set to 5. The comparisons are made with original JM 9.2 reference model, which is set to utilize the Fast Full Search within the maximum 5 reference frames. Various test sequences are used in the simulation to evaluate the R-D performance and process time saving. Table 3.6 and Table 3.7 show the average 45

47 values of PSNR loss, bit-rate overhead and process time saving for the sequences with different contents and picture sizes. Comparisons of R-D curves are presented in Fig. 4. According to the simulation results, proposed algorithm can provide a 34%~47% process time saving for various test sequences. The maximum PSNR loss is about 0.20dB (Mobile CIF) but most of the PSNR loss is less than 0.1dB. Meanwhile, the average bit-rate overhead is less than 0.12%. Tempete CIF Football CIF PSNR (db) PSNR (db) JM 92 (5 Ref. Frames) Proposed JM 92 (5 Ref. Frames) Proposed Bit-rate (Kbps) Bit-rate (Kbps) Bus QCIF PSNR (db) JM 92 (5 Ref. Frames) Proposed Bit-rate (Kbps) Figure 8. Comparisons of R-D curves Conclusion Section 0 proposes a fast multiple reference frame motion estimation algorithm for H.264/AVC. Fast reference search and decision scheme is adopted to speedup the original multiple reference frame motion estimation algorithm. With the consideration of the characteristics of H.264/AVC, several simplification approaches are utilized to 46

48 provide a 38% processing time saving in average. In addition, the R-D performance loss is acceptable fore most of the test sequence. 3.4 Feature-Detection-Based Early-Termination Algorithm for Multiple Reference Frame Motion Estimation in H.264/AVC Introduction Motion compensation in video coding algorithms is utilized to exploit the temporal correlation between frames. Intuitionally, closer reference frame usually has more correlation with the current frame. According to the simulation results shown in Table 3.4, the previous reference frame just before current frame usually has the highest probability, more than 90% sometimes, to be the best reference frame. Therefore, it is quite inefficient that the reference model JM9.2 adopts uniform motion estimation processing for each possible reference frame. In Section 0, the original multiple reference frames algorithm is speeded up by a fast reference search and selection procedure. Differently, Section 3.4 analyzes the reasons that some particular video contents require more reference frames than others do, and then proposes three feature-based detectors to detect those contents. After the feature detection, motion estimation is conducted in further reference frames only when it is necessary, namely when the particular video contents occur in the scene. In addition, an 47

49 adaptive SAD criterion is set to guarantee that the motion prediction error is below an acceptable level. On account of the target of simplification, only motion vector and SAD values are involved in the feature detection and early termination algorithms, since those values can be easily obtained from the original encoding process. Threshold selection is also critical for the algorithm performance, which results in strict or weak conditional decisions and trade off the processing time against R-D performance Proposed Feature-Detection-Based Fast Algorithm Overview Simulation result in Table 3.4 shows the distribution of the best reference frame for each 8 8 sub-block when maximum reference frame number equals to 5. F t-1 ~ F t-5 represent five reference frames, and F t-1 is the one just before current frame. The distribution indicates i) F t-1 has the highest probability (59~96%) to be the best reference frame; ii) basically, the closer to the current frame, the reference frame has the higher possibility to be the best frame. These tendencies suggest that the original brute-force search within all the reference frames is quite inefficient, and a possible simplification can be achieved by applying an early-termination scheme to omit unnecessary reference frames that have little contribution to the coding efficiency. On the other hand, video contents influence the amount of the gain induced by MRF motion estimation. As shown in Table 3.3, PSNR loss caused by reducing the reference frames varies within a large range for different sequences. For example, the decreasing of 48

50 reference frame number form 5 to 1 leads to an 1.2dB PSNR loss for mobile while only 0.06dB for football. This result indicates that MRF motion estimation provides obvious gain for some special video contents, whereas negligibly improves the coding efficiency for the others. There are many reasons for MRF ME to achieve better prediction [35], which include fast repetitive motions, camera shaking, covering or covering of objects, fast luminance change, such as flash, flash or shadow change, noise or sampling alias. With the consideration of different impact on block-based encoding processing adopted by H.264/AVC, most of them can be classified into the following four types: Type 1. Type 2. Fast repetitive motion; Boundary of covering or uncovering objects; Type 3. Lighting, flash, fast shadow change, and covering or uncovering within whole macroblock; Type 4. Aliasing-sampling; Among those four types, Type 1, 2 and 3 are corresponding to some particular video contents, while Type 4 is the consequence of finite sampling rate of digital image. The performance loss caused by Type 4 can be diminished by employing fractional accuracy motion estimation to some extent [28], so Type 4 is not discussed in Section 3.4. Covering or uncovering objects results in two different impacts for a macroblock depending on whether the macroblock is located on the boundary of covering objects or not. When a macroblock is located on the boundary of a covering object, only parts of the block is being covered, which usually degrades the motion estimation result of one or 49

51 several sub-macroblocks within the whole macroblock. In contrast, a macroblock may be totally covered by another object with the proceeding of object moving. In this case, the motion estimation result of whole macroblock is affected. Therefore, it is better to say Type 2 contents leads to the motion estimation fail of parts of a macroblock, while Type 3 contents leads to the motion estimation fail of a whole macroblock. Figure 9. Flow-chart of proposed feature-detection-based fast algorithm Proposed algorithm employs feature-based detectors to detect those particular contents from Type 1 to Type3, and early terminates motion estimation in further reference frames unless one or several particular MRF-related contents are detected. In order to implement the feature-based detections with low complexity overhead, only motion vectors and SAD values are used in the algorithm. Figure 9shows a simplified flow chart of proposed algorithm. For each block in current frame F t, the motion 50

52 estimation from previous frame F t-1 is always performed. After that, detectors for Type1, 2 and 3 detect if there is special content which requires motion estimation in further reference frames. Particularly, a Possible Best Reference Frame (PBR) can be obtained when Type 1 content is detected. In order to make the algorithm robust, an adaptive SAD criterion TH ET_SAD is set to ensure that the motion prediction error is below a certain level. Intuitionally, it is reasonable to utilize motion vectors to detect different motion contents. In many cases, however, motion vector cannot represent the real motion, since it is actually just the result of block match. Therefore, SAD-based constraints are imposed to strengthen the detection conditions with the assumption that smaller SAD usually results from more accurate motion estimation. On the other hand, several thresholds are introduced in proposed algorithm, which affect algorithm performance to a large extent. According to the simulation results, it can be found that SAD values and motion vectors vary in large ranges depending on the video contents. Thus it is difficult to obtain feature models of video contents by using fixed thresholds. Proposed algorithm adopts adaptive approach to decide the thresholds for each detector. Basically, the thresholds are obtained by performing a prediction operation, in which the prediction values are derived from the SAD values and motion vectors of the previously encoded blocks. For the proposed detection algorithm, miss detection leads to the loss of R-D performance, while false alarm leads to the increase of processing time. Since TH ET_SAD is utilized to ensure the whole R-D performance level as 51

53 illustrated in Figure 9, relatively strict decision conditions are applied in the feature detectors. In Section , Section , and Section , the feature detection models are presented first, and then the methods to decide corresponding thresholds are given out. In Section , the SAD-based early termination decision is presented Detector of Type 1 Figure 10. MVCP for detection of Type 1 As shown in Figure 10, F t represents current frame, while F t-1, F t-2 and F t-3 represent previous frames. If a repetitive motion occurs, the best match-block may be obtained from reference fame F t-2 other than F t-1. Proposed algorithm employs a prediction (MV MV_CP ) by summing up the concatenated MVs of different frames, as shown below: MV = MV + MV ' CPtr, 1 t, r rr, 1 (3.4.1) where r [ t 1, t MaxRefNum] ; MaxRefNum represents Maximum Reference Frame Number; MV tr, represents the MV pointing from current frame F t to current reference frame F r ; MV ' rr, 1 represents the MV pointing from F r to F r-1, which starts 52

54 from the block that MV tr, points to. It is obtained from the motion estimation has been performed on F r, and is stored for the calculation of MV CP tr. Figure 10 illustrates an, 1 example of MV CP tr when r=t-1. When fast repetitive motion occurs, for example, an, 1 object in F t-2 within the two dot-lines moves away in F t-1 and then comes back in F t. With the assumption that the MVs is accurate enough, the modulus of MV CP tr should be, 1 smaller than that of MV tt, 1. More generally, The detector for Type 1 content can be described as below: MV CP < MV, 1, tr t r and SADtr, < TH SAD1, (3.4.2) and SAD ' rr, 1 < TH SAD1 where SAD t,r and SAD r,r-1 are the SAD values corresponding to MV tr, and MV ' rr, 1 respectively. When inequalities (3.4.2) hold, the detector regards current block as a block with Type 1 content. When Type 1 content occurs, inequalities (3.4.2) may hold for more than one reference frame. In that situation, a Possible Best Reference (PBR) is selected by minimizing the modulus of MV CP tr. A simple reason is that smaller MV usually, 1 leads to smaller coding cost. When PBR is not the normal next frame, one or more reference frame is skipped, and motion estimation is performed in PBR. TH SAD1 is utilized to ensure the MVs used in the detector are reliable. A minimum prediction from neighboring blocks is adopted. As shown in Figure 11, the threshold is the median of the three SAD values of neighboring blocks Block A, Block B, and Block C. These blocks are the adjacent blocks on the left, top, and top-right of the current block. The threshold is obtained by applying the median prediction as described in (3.4.3): 53

55 TH SAD1 =Median(SAD A, SAD B, SAD C ) (3.4.3) Figure 11. Neighboring blocks of current block Figure 12. MV values and SAD values for detection of Type Detector of Type 2 In order to detect Type 2 content, we adopt MV dispersivity [30] and define SAD difference as (3.4.4) and (3.4.5): = (3.4.4) MV disp MV* MV 0 + MV* MV1 + MV* MV 2 + MV* MV 3 SAD = SAD* ( SAD + SAD + SAD + SAD ) (3.4.5) disp where MV * while MV i and SAD * are the MV and SAD of large blocks, such as block, and SAD i are those of small blocks, such as 8 8 blocks, as shown in Figure 12. When Type 2 content occurs, which means the motion estimation results are not good for parts of the macroblock, MV disp tends to be larger than that of the common situation (no Type 2 content occurs). The detector for Type 2 content can be described as 54

56 (3.4.6) MV disp > TH MVdisp and SAD > TH diff SADdiff (3.4.6) When inequalities (3.4.6) hold, motion estimation is performed in the further next reference frame. Compared to [30], we use SAD criterions as well as MV dispersivity to archive more accurate detection. In addition, adaptive thresholds are utilized to make the proposed algorithm work well fore for different video sequences. TH MV disp and TH SAD diff is utilized to fulfill the detection of Type 2, which are derived from the corresponding MV disp and SAD diff values of the neighboring blocks, as shown in (3.4.7) and (3.4.8): TH TH MVdisp SADdiff MV + MV disp _ ref _ r disp _ ref _ futher = (3.4.7) SAD 2 + SAD diff _ ref _ r diff _ ref _ futher = (3.4.8) 2 where MV disp _ ref _ r is the mean of MV values of the neighboring blocks that use current reference frame F r as the best reference frame, while MV disp _ ref _ futher is the mean of MV values of the neighboring blocks that use further reference frames. Similarly, SAD is the mean of SAD values of the neighboring blocks that use current diff _ ref _ r reference frame F r as the best reference frame, while SAD diff _ ref _ futher is the mean SAD values of the neighboring blocks that use further reference frames. The neighboring blocks include Block A, Block B, Block C, and Block D, as shown in Figure 11. If all the neighboring blocks use F r as the best reference frame, it is assumed that no Type 2 content occurs, while if all the neighboring blocks use further reference frame other than 55

F r as the best reference frame, it is assumed that Type 2 content occurs. 3.4.2.4 Detector for Type 3 Figure 13.

57 F r as the best reference frame, it is assumed that Type 2 content occurs Detector for Type 3 Figure 13. Trends of inter/intra SAD with flash occurring When Type 3 content occurs or ends, fast luminance change of fast covering usually leads to bad block match results, namely large SADs, from the nearest reference frame. A typical case is illustrated in Figure 13, in which flash occurs in Frame 3 and lasts for only one frame, and there is no obvious motion of objects. It can be found that inter SAD increases significantly with the flash occurring, while intra SAD basically changes as usually. Therefore, we define Intra SAD increase and Inter SAD increase as (3.4.9) and (3.4.10): Intra _ SAD _ inc = Intra _ SAD Intra _ SAD (3.4.9) t t t 1 Inter _ SAD _ inc = Inter _ SAD Inter _ SAD (3.4.10) t t t 1 where Intra _ SAD t is the intra prediction SAD value of current block in current frame 56

58 F t, and Intra _ SADt 1 is that of the corresponding block in precious frame F t-1 ; Inter _ SAD t is the inter prediction value of current block in F t, while Inter _ SADt 1 is that of the corresponding block in F t-1, which are obtained by performing motion estimation from the nearest reference frame. The detector can be described as (3.4.11) below. Intra _ SAD _ inc < TH t Intra _ SAD _ inc and Inter _ SAD _ inc < TH t Inter _ SAD _ inc (3.4.11) when inequalities (3.4.11) hold, the detector assumes that Type 3 content occurs. Shadow changes or object covering could take place in a slow manner, and last many frames. If the duration exceeds the range of maximum reference frame, encoding efficiency does not benefit from MRF motion estimation. Besides, most of the lighting and flash last no more than three frames. Thus, proposed algorithm only considers Type 3 contents occurring within three frames. Conditions (3.4.11) are checked for the macroblocks with corresponding position in concatenated frames. If (3.4.11) hold more than once within three frames, the first one is regarded as the start frame of Type 3 content. MRF is enabled for the macroblock in the frames after the start frame. TH Intra _ SAD _ inc and TH are obtained as described in (3.4.12) and (3.4.13): Inter _ SAD _ inc TH Intra _ SAD _ inc = MAE( Intra _ SADi ) (3.4.12) TH Inter _ SAD _ inc = MAE( Inter _ SADi ) (3.4.13) where i [ t 1, t MaxRefNum] i IT 3 ;I_{T3}; I T 3 is the set of indexes of the 57

59 previous frames with Type 3 contents; MAE represents the calculation of Mean Absolute Error SAD-Based Early Termination Decision Although three different feature-based detectors are designed as presented above, it is still very difficult to fulfill the accurate decision of reference frame number for the blocks with various contents and contexts. First, Type 1 ~ Type 3 contents cannot cover all the video features that require further reference frames. Second, based on the simple feature models involving only MVs and SADs, even the detection itself is not accurate enough, since MV and SAD are just the result of block match, and can only represent the actual motion to some extent. Therefore, in order to ensure the coding efficiency, an SAD-based early termination decision is employed in proposed algorithm. When none of the special contents (Type 1 ~ Type 3) is detected, the SAD value of current block is checked under an adaptive SAD criterion (TH ET_SAD ). If current SAD value is larger than TH ET_SAD, which means the inter prediction from current reference frame is still not good enough, motion estimation need to be performed in further reference frames. TH ET_SAD is derived as (3.4.14): TH = β SAD (3.4.14) ET _ SAD pred where SAD pred represents the prediction value of SAD for current block, which is obtained by median prediction from the SAD values of neighboring blocks, as shown in 58

60 (3.4.15); β is an adjusting factor, which varies from 0.75 to 1.05 according to the QP (Quantization Parameter) and reference frame index. The prediction of SAD tends to be inaccurate with the increase of QP, so more reference frames are required to ensure the coding efficiency. Therefore, β is designed to be negative correlative with QP. On the other hand, the probability of terminating motion estimation for current block tends to be higher with the increase of reference frame index (searching in the further reference frame). Thus, β is designed to be positive correlative with reference frame index. The β values are obtained experimentally and stored in a look-up table. SAD = Median( SAD, SAD, SAD ) (3.4.15) pred A B C Simulation Results Proposed algorithm is implemented based on JM 9.2 reference model. Maximum reference frame number is set to 5. The comparisons are made with original JM 9.2 reference model, which is set to utilize the Fast Full Search within the maximum 5 reference frames. Various test sequences are used in the simulation to evaluate the R-D performance and time saving of whole encoding process. Table 3.8 and Table 3.9 show the values of PSNR loss, bit-rate overhead and processing time saving for QCIF sequences and CIF sequences respectively. Comparisons of R-D curves are presented in Figure 14. For QCIF sequences, QP is set to 24, 28, 32, 36, and 40 respectively, with search range [-16, 15]. For CIF sequences, QP is set to 20, 25, 30, 35, and 40 respectively, with search range [-32, 31]. According to the simulation results, proposed algorithm can 59

61 provide a 25%~40% (average 35%) processing time saving for various test sequences. The maximum PSNR loss is about 0.20dB (Mobile CIF) but most of the PSNR loss is less than 0.1dB (average 0.08dB). Meanwhile, the average bit-rate overhead is less than 2%. Table 3.8 R-D performance and time saving for QCIF sequences Sequence SNR Loss (db) Bit-rate Overhead Time Saving Mother&Daughter % 36.7% Bus % 38.0% Paris % 39.4% Mobile % 37.2% Husky % 26.7% Canoa % 32.2% Container % 30.2% *Search Range =16, QP=24, 28, 32, 36, 40 Table 3.9 R-D performance and time saving for CIF sequences Sequence SNR Loss (db) Bit-rate Overhead Time Saving Head with Glasses % 35.0% Tempete % 36.5% News % 34.4% Akiyo % 33.5% Football % 32.6% Stefan % 41.5% Forman % 38.9% * Search Range =32, QP=20, 25, 30, 35, 40 60

62 Mobile QCIF Husky QCIF Container QCIF PSNR (db) PSNR (db) PSNR (db) JM 92 (5 Ref. Frames) 27 JM 92 (5 Ref. Frames) 31 JM 92 (5 Ref. Frames) 26 Proposed 25 Proposed 29 Proposed Bit-rate (Kbps) Bit-rate (Kbps) Bit-rate (Kbps) Akiyo CIF Tempete CIF Stefan CIF PSNR (db) PSNR (db) PSNR (db) JM 92 (5 Ref. Frames) JM 92 (5 Ref. Frames) JM 92 (5 Ref. Frames) 34 Proposed 26 Proposed 28 Proposed Bit-rate (Kbps) Bit-rate (Kbps) Bit-rate (Kbps ) Figure 14. Comparisons of R-D curves Conclusion Section 3.4 presents a feature-detection-based early termination approach for multiple reference frame motion estimation algorithm in H.264/AVC. Three feature-based detectors with adaptive thresholds are designed to detect the video contents requiring more reference frames. To keep the complexity overhead as low as possible, only MV and SAD values resulting from previous coding processing are involved. The adaptabilities of thresholds make the algorithm stable for a large range of video sequences. Moreover, an SAD criterion is set to ensure that the prediction error is below a certain level. Simulation results show that the proposed algorithm provides 35% processing time saving in average with a negligible coding efficiency loss. 61

63 3.5 Fast Multiple Reference Frames Motion Estimation with Transform-domain Analysis Introduction When maximum reference frame number equals to 5 (Ref_0~Ref_4 represent five reference frames and Ref_0 is the one just before current frame), As mentioned in Section 0and 3.4, the simulation results indicate i) Ref_0 has the highest probability (59~96%) to be the best reference frame; ii) basically, the closer to the current frame, the reference frame has the higher possibility to be the best frame. These tendencies suggest that the simplification of original motion estimation can be achieved by applying an early-termination scheme on the frame level other than performing the brute-force search within all the reference frames. On the other hand, video contents influence the amount of the gain induced by MRF motion estimation. There are many reasons for MRF ME to achieve better prediction, and most of them can be classified into the following four types, which has been discussed in Section We re-list them below: Type 1. Type 2. Type 3. Type 4. Repetitive motions or camera shaking Covering or uncovering objects Lighting, flash or fast shadow change Aliasing-sampling Among those four types, Type 4 is the result of finite sampling rate of digital image, 62

64 while the first three are corresponding to some particular video contents. Section 3.4 proposes different feature-based detectors for these particular contents. In Section 3.5, the correlation between aliasing-sampling and MRF motion estimation is discussed. Further more, a hadamard transform-based method is proposed to detect whether a strong aliasing occurs in current block, and thus to decide if the motion estimation in further reference frames is required. In the mean time, a QP-related threshold is derived for each transform residual in a 4 4 block to make the early-termination more accurate Aliasing-sampling Figure 15. Sampling and aliasing in frequency domain Digitalization of real image is a procedure of sampling. Considering the situation in frequency domain, sampling can be indicated by a periodical repeating of Fourier transform. According to Nyquist s sampling theorem, if the sampling rate is less than Nyquist s rate (two times of the bandwidth of original signal), the original signal cannot be recovered form the sampled signal, as shown in Figure 15. In other words, aliasing takes place in this case. For the image in reality, the bandwidth is very huge, but the sampling rate is always limited to a finite value. Therefore, aliasing sampling is very 63

65 common in digital image. And it can be easily figured out that the high frequency components dominant the aliasing. For an original signal with the bandwidth less than half a sampling rate, no aliasing occurs. Figure 16. An example of the impact of aliasing-sampling on motion estimation In the space domain of picture, aliasing sampling can be illustrated with the example shown in Figure 16. A point on the integer pixel position of current frame may not be on the integer/sub pixel position of previous frame. But it could be just on an integer pixel position in some other reference frames. That is reason why MRF motion estimation can diminish the prediction performance loss caused by aliasing-sampling. Two efficient methods have been adopted to reduce the influence caused by aliasing-sampling. One is fractional pixel accurate motion compensation with interpolation filtering. The other is MRF motion estimation [28]. Intuitionally, MRF motion estimation is necessary only when the aliasing is heavy. Thus, motion estimation in further reference frame can be enabled only when it is necessary to reduce the computational complexity of original algorithms. However, it is very difficult to measure the impact of aliasing directly. Therefore, some other signal features are required to be explored to find the correlation between them and the aliasing impact. 64

66 As mentioned before, the high frequency component dominant the aliasing impact. In other words, high frequency components accounting for more energy usually means stronger aliasing, so requires more reference frames. Taking test sequence mobile as an example, the video contents in mobile have a lot of complex details, which is represented with more high-frequency component energy, and correspondingly, mobile sequence require more reference frames to archive sufficient R-D performance. It is still true, when talking about the high frequency component of prediction error signal. With the assumptions: i) no displacement estimation errors, and ii) no quantization errors, Ref [41] gives out the analysis on the correlation between aliasing and prediction error signal of motion compensation, as shown in (3.5.1), (3.5.2): 1 jπ Ω jd ( / ) ( ) 2 ( )sin( s Ω Ω Ω Ω Ω + x )[ 2 s E jω = A jω d e 2 ] (3.5.1) t t 1 x 2 Ω Ω E ( jω ) = 2 A ( jω ) sin( d π ) t t 1 x Ω (3.5.2) = 2 A ( jω ) sin( d π ) t 1 x where A ( jω) indicates the aliasing components, which usually are the high t 1 frequency components of the underlying original signal. Equation (3.5.1) and (3.5.2) denote high frequency components dominate the prediction errors. Thus, the next problem is how to obtain the high frequency components and the percentage it account for in the whole signal spectrum. 65

67 3.5.3 Transform domain analysis It is easy to obtain spectrum information from transform domain. The well-known Discrete Cosine Transform DCT is a very good example for this. For an n n 2-D DCT, all the transformed coefficients have explicit meaning in the frequency domain. The coefficient in the top-left corner of transformed coefficient matrix represents the DC component of signal, while the coefficient in the bottom-right corner of the matrix represents the highest frequency component. For the other coefficients, the one with longer distance from DC coefficient represents higher frequency component. However, it is not very convenient to utilize the DCT transform to obtain spectrum information in H.264 encoder, since the DCT is conducted after the motion compensation stage. Like 2D DCT, 2D Hadamard transform is also an orthogonal separable transform, which is adopted in H.264 to calculate SATD. A Hadamard transform can be represented as Tr = HRH where Tr is transformed residue matrix, H is Hadamard matrix, R is residue matrix. For a 4 4 Hadamard transform, the hadamard matrix can be represented as H = (3.5.3) Compared with DCT, Hadamard transform has similar energy concentration effect (not as good as DCT). Similar frequency meaning in transform domain. As for the 66

68 transformed coefficient matrix shown in the following: tr tr tr tr tr tr tr tr Tr = tr tr tr tr tr tr tr tr (3.5.4) where tr00 represents the DC component, while tr 33 represents the highest frequency component. Furthermore, Hadmard transform has very low computational complexity, and easy to utilize fast algorithm (FHT). More important, the Hadamard transformed coefficients can be directly obtained from previous encoding processes, since it has been adopted in H.264 to calculate SATD Proposed algorithm As shown in Figure 17, proposed algorithm is implemented based on the framework of JM9.2. The Hadamard transform results in JM9.2 are utilized to obtain the high frequency component, as shown in ( ) 3 3 SATD = tr (3.5.5) i= 0 j= 0 i, j SAHFC = tr (3.5.6) i+ j 3 i, j SAHFC HFCR = (3.5.7) SATD where SATD denotes sum of absolute transformed-difference, SAHFC denotes sum of absolute high-frequency component, and HFCR denotes the high-frequency component 67

69 rate, which is a normalization with SATD. Figure 17. Flow chart of proposed algorithm For current block, if HFCR > TH hfcr, motion estimation is performed in further reference frames, otherwise, the motion estimation is terminated after the ME for current reference frames. In cases that tr i,j and SATD are very small, the HFCR criterion cannot work well. Fortunately, when tr i,j and SATD are very small, the DCT transformed coefficients tend to be zeros after quantization, which indicates the prediction errors are already small enough. Thus, no more reference frames are required, so the motion estimation can be terminated after current reference frame. Therefore, we introduce an QP-related threshold TH tr for each tr i,j. If a tr i,j is smaller than the threshold, it is set to zero before calculating SAHFC, as shown in (3.5.8). A 68

70 SATD can be obtained based on the modified tr i,j. By doing that, small tr i,j usually yield SATD equal to zero. Thus, motion estimation can be terminated when the SATD of current block is equal to zero If (tr i,j < TH tr ) tr i,j = 0; (3.5.8) Two thresholds are introduced in proposed algorithm: TH hfcr and TH tr. TH hfcr is decided according to the experiments. TH tr is utilized to decide whether a transform coefficient is small enough, and tends to be zero after quantization. The quantization processing can be described as: QM[ i][ j] = ( TR[ i][ j] i quant _ coef [ i][ j] + qp _ const) >> qp _ bits where QM[][ i j ] is the 4 4 quantized coefficients, TR[][ i j ] is the 4 4 transform residuals, quant _ coef [ i][ j ] is quantization coefficients matrix, qp_bits=qp/6+15, qp_const=(1<<qp_bits)/6. It can be observed that QM[][ i j ] equals to zero only when the following condition holds: ( TR[ i][ j] i quant _ coef [ i][ j] + qp _ const) < 2 qp _ bits (3.5.9) Therefore, the thresholds for each transform residual can be obtained as (3.5.10) qp _ bits 2 qp _ const THtr[][ i j] = (3.5.10) quant _ coef [ i][ j] Simulation Results Proposed algorithm is implemented in JM 9.2 reference model. Maximum reference 69

71 frame number is set to 5. The comparisons are made with original JM 9.2 reference model, which is set to utilize the Fast Full Search within the maximum 5 reference frames. Test sequences with different contents are utilized in the simulation to evaluate the R-D performance and process time saving. Table 4, shows the average values of PSNR loss, bit-rate overhead and process time saving for the sequences with different contents. According to the simulation results, proposed algorithm can provide a 30%~35% process time saving with an acceptable coding efficiency loss. Table 3.10 R-D performance and time saving for QCIF sequences Sequence SNR Loss (db) Bit-rate Overhead Time Saving Mother&Daughter % 31.5% Bus % 28.0% Mobile % 25.2% Husky % 29.7% Container % 30.4% *Search Range =16, QP=24, 28, 32, 36, 40 Table 3.11 R-D performance and time saving for CIF sequences Sequence SNR Loss (db) Bit-rate Overhead Time Saving Head with Glasses % 31.2% News % 30.4% Akiyo % 29.5% Football % 32.6% Forman % 31.9% * Search Range =32, QP=20, 25, 30, 35, Conclusion Section 3.5 presents a fast multiple reference frame motion estimation algorithm with transform domain analysis. The relation between high-frequency components in transform residual and the aliasing phenomenon is analyzed and utilized to implement 70

72 efficient early-termination of multiple reference frames search. Meanwhile, a QP-related threshold is derived for each transform residual in a 4 4 block to make the early-termination more accurate. Since the transform-domain analysis is based on the original inter-prediction algorithm framework. The complexity overhead is very small. Simulation results show that the proposed algorithm achieves a 30% processing time saving in average for various test sequences, and there is no obvious R-D performance loss. 3.6 Conclusion The motion estimation in H.264/AVC introduces three major new coding tools: variable block size, quarter-pixel accurate motion estimation, and multiple reference frames motion estimation. Chapter 3 focus on the studies of the third one, multiple reference frames motion estimation. According the results of the simulation and analysis presented in Section 3.2, different approaches are proposed to simplify the computational complexity of original algorithms. Section 0 proposes a fast algorithm with fast reference search and decision (referred to as Algorithm I). In the mean time, with the consideration of the characteristics of H.264/AVC, several simplification techniques are utilized to provide averagely 38% process-time saving for various sequences. Section 3.4 presents a feature-detection-based early termination approach for multiple reference frame motion estimation (referred to as Algorithm II). Three feature-based detectors with adaptive 71

73 thresholds are designed to detect the video contents requiring more reference frames. To keep the complexity overhead as low as possible, only MV and SAD values resulting from previous coding processing are involved. The adaptabilities of thresholds make the algorithm stable for a large range of video sequences. Moreover, an SAD criterion is set to ensure that the prediction error is below a certain level. Simulation results shown that a 35% processing time saving can be achieved in average. Section 3.5 presents a fast multiple reference frame motion estimation algorithm with transform domain analysis (referred to as Algorithm III). The relation between high-frequency components in transform residual and the aliasing phenomenon is analyzed and utilized to implement efficient early-termination of multiple reference frames search. Meanwhile, a QP-related threshold is derived for each transform residual in a 4 4 block to make the early-termination more accurate. Since the transform-domain analysis is based on the original inter-prediction algorithm framework. The complexity overhead is very small. Simulation results show that the proposed algorithm achieves a 30% processing time saving in average for various test sequences, and there is no obvious R-D performance loss. Although all those three algorithms introduce very small computational complexity overhead, Algorithm I and Algorithm II are not suitable for hardware implementation, since the former introduce the irregular access of reference frame buffer, while the latter requires a noticeable memory capacity overhead to store the motion information of several previous frames. In contrast, Algorithm III is hardware-friendly and is suitable for 72

74 current semiconductor technology. In order to keep the problem simple and clear for discussing and exploring, Chapter 3 only focus on the impact of the feature of multiple reference frames. However, the simulation results indicate that Algorithm I, II and III can provides averagely 38%, 35%, and 30% of processing time savings respectively. The results also imply that it is very hard to archive a processing-time saving higher than 50% by only utilizing the optimizations regarding multiple reference frames. Therefore, some combined approaches should be taken into consideration for more speed improvement, which have been approved in some new results of our works. One example of those combined approaches is to adjust the reference frame number and search range together, which provides more than 60% processing-time saving. 73

75 Chapter 4 Architecture Design of New Coding Tools 4.1 Introduction As mentioned in Chapter 2, H.264/AVC introduces many new coding tools. Among those coding tools, in-loop deblocking filter and CABAC entropy coding are two very important parts. In-loop deblocking filter reduces the bit rate typically by 5%-10% while producing the same objective quality as the non-filtered video [2], but it is compute intensive and easily accounts for one-third of the computational complexity of decoder [44][45]. Particularly, the hardware implementation of deblocking filter is challenging due to the high adaptability and also due to the complex and intensive data access patterns. Compared to the other optional entropy coding scheme CAVLC, CABAC typically provides a reduction in bit-rate between 9%-14% [56], at the cost of complexity increasing by 25%-30% for encoding and 12% for decoding [5]. Since CABAC involves bit-wise operations, and complicated data dependencies between the concatenated operations, it is very hard to increase the hardware throughput by utilizing general parallelization schemes. Therefore, it is also a challenge for highly efficient hardware implementation of CABAC. The rest of Chapter 4 is organized as following. Section 4.2 presents a 74

76 cost-efficient deblocking filter architecture for H.264/AVC. Parallelism is explored with various approaches for memory sub-system and datapath respectively. Section 4.3 presents a multipurpose CABAC codec architecture for H.264/AVC, which support both encoding and decoding of CABAC by using a highly efficient combined architecture. 4.2 Architecture Design of In-Loop Deblocking Introduction Filter Engine for H.264/AVC Although several new features are introduced, H.264/AVC follows the block-based hybrid coding approach similar to previous video coding standards, in which each picture is represented and processed in block-shaped units. It is well known that the block-based processing, such as block-based prediction, transformation and quantization, induce a lot of distortion or noise in the boundaries of blocks, which degrade both objective and subjective quality of video streams. In order to eliminate or diminish this kind of block artifacts, two different schemes of deblocking filtering have been proposed: post filter and in-loop filter. The latter is employed in H.264/AVC as shown in Figure 2. The advantages of in-loop filter over post filter are discussed in [43]. Experimental results show that the in-loop deblocking filter reduces the bit rate typically by 5%-10% while producing the same objective quality as the non-filtered video [2]. However, deblocking filtering in H.264/AVC is compute intensive due to the high 75

77 adaptability and also due to the complex and intensive data access patterns. It easily accounts for one-third of the computational complexity of decoder [44][45]. In order to remove the block artifacts efficiently, H.264/AVC adopts a highly adaptive deblocking filter scheme, which heavily increase the complexity of filtering operations. Several parameters and thresholds and also the local characteristics of the picture itself control the strength of the filtering process. Those parameters include some syntax elements in the bitstream (such as quantization parameter QP), boundary strength (represents the difference of the two blocks on either side of the edge) and the gradient of samples across the edge. All the filter thresholds are quantizer dependent, since blocking artifacts always become more severe when a coarse quantization is performed. For hardware implementation, the adaptive filter can be realized by multiple independent filters combined with a selection circuit. This selection circuit selects the proper filter and parameters based on the content-dependent checks. Figure 18. Edges need to be filtered in a macroblock Data access is another main reason that introduces high complexity in deblocking filtering. As shown in Figure 18, deblocking filter operations are applied to the horizontal 76

78 and vertical edges of every 4 4 block in a picture. Moreover, almost every sample in a picture needs to be accessed. For LSI implementation, an efficient memory system is required to support parallel data access in both horizontal and vertical directions. Ref [46] and [47] proposed different architectures of deblocking filter for H.264/AVC, but both of them employed straightforward memory arrangement, which cannot fulfill parallel access in two directions. Consequently, the performance and implementation efficiency of these designs are limited. Different parallel memory systems were proposed to achieve efficient access to rows, columns, diagonals and subarrays without memory conflicts [48][49]. However, these schemes suffer from low memory utilization, difficulty in address generation or nonconstant access time for different access modes. Another efficient parallel memory scheme, which is known as skewed memory, can support the access in both horizontal and vertical directions very well [50]. Based on the concept of skewed memory, the proposed architecture adopts a 2-dimensional parallel memory to increase the throughput of deblocking filtering. Figure 19. Pixels involved in a filtering operation In Section 4.2, an architecture design of the deblocking filter in H.264/AVC is proposed. A parallel memory scheme with linear shifting/rotating addressing circuits is 77

79 adopted to support parallel access in both horizontal and vertical directions. In the datapath, hardware reuse is achieved by optimizing the original filtering algorithms in H.264/AVC specification. A 4-stage pipeline scheme is also employed to improve the throughput. The rest of Section 4.2 is organized as follows. Section gives an introduction to the deblocking algorithm in H.264/AVC. In Section 4.2.3, the architecture design for deblocking filter is presented. Implementation results and comparisons are shown in Section Finally, we make a conclusion in Section Deblocking Filter Algorithm in H.264/AVC In H.264/AVC, adaptive 1-dimensinal filters are adopted, and deblocking filtering is applied to the edges of 4 4 blocks in each macroblock. These edges are indicated by the dotted lines as shown in Figure 18Figure 19. The inputs of each filtering operation include four pixels on either side of an edge, as shown in Figure The result of each filtering operation affects up to three pixels on either side of the edge (i.e. p2, p1, p0, q0, q1, q2). As stated in Section 4.2.1, the parameters and the number of taps of filter can vary adaptively according to coding contents. First, a boundary strength (Bs) is assigned depending on the difference of encoding characters of the two blocks on either side of the edge. Here, encoding characters involve the prediction modes, reference pictures, number of reference pictures, values of motion vectors and so on. For example, when the two blocks are intra coded and the edge is a macroblock boundary, Bs is assigned with 4, which means the strongest filtering will be or may be applied to this edge. Second, for a 78

80 set of eight samples across the edge, a gradient-like analysis is performed to decide whether the filtering should be switched off to preserve the sharp feature mainly caused by the actual picture source but not by the blocking artifacts. The filtering is switched off if any of the following conditions is not true: Bs > 0, p q < α, p p < β, q q < β Architecture Design of Deblocking Filter Two important problems need to be handled for hardware design of deblocking filter in H.264/AVC: i) efficient memory access and ii) high performance architecture design for adaptive filter. The architecture proposed employs a skewed memory arrangement scheme to achieve efficient parallel memory access in two directions. Due to the low dependency between the input data of two filtering operations, our design also utilizes pipeline approaches to explore the parallelism of filtering operations. Figure 20. Block diagram of the proposed deblocking filter 79

81 A finite impulse response (FIR) filter is usually implemented into a single-input single out (SISO) system, in which the sample of the signal is inputted and output serially. However, since the input sequence of each deblocking filtering operation is 8 samples and the filter is adaptive for every 8 samples, ordinary serial filter suffers from low performance. On the other hand, it is easy to realize parallel input and output of multiple pixel samples in respect that a block or macroblock of the picture is usually stored in on-chip memory before and after processing. Therefore, it is advantageous to employ parallel filters combined with a parallel memory sub-system, which can achieve the higher throughput as well as the higher implementation efficiency. The block diagram of the proposed deblocking filter is showed in Figure 20. The control unit sends filter parameters to the adaptive filter unit, such as boundary strength. It also generates address and read/write control signals for parallel memory unit. Parallel memory unit stores the samples of a macroblock (16 luma 4 4 blocks and 8 chroma 4 4 blocks) and the samples of adjacent blocks (8 luma 4 4 blocks and 8 chroma 4 4 blocks) before and after deblocking filtering. Dual-port SRAMs are used to make it possible to perform one read operation and one write operation within one clock cycle. Adaptive filter fulfills the 1-dimensional filter operation for vertical and horizontal edges. Parallel filter structure is used to obtain a high throughput. Memory accesses and filtering operations are executed in a pipelined manner. In each cycle, 8 pixels are sent from the parallel memory unit to the adaptive filter unit and 8 filtered pixels are written back. 80

4.2.3.1 Parallel Memory Unit 2-dimensional memory unit is comprised of 8 dual-port SRAM modules, which ensures the parallel access of 8 pixels.

82 Parallel Memory Unit 2-dimensional memory unit is comprised of 8 dual-port SRAM modules, which ensures the parallel access of 8 pixels. The 4 4 blocks in a macroblock and the corresponding adjacent 4 4 blocks are mapped and stored sequentially in memory unit as shown in Figure 21. Figure 21. Memory mapping of 4 4 blocks 81

Advanced Video Coding: The new H.264 video compression standard

Advanced Video Coding: The new H.264 video compression standard August 2003 1. Introduction Video compression ( video coding ), the process of compressing moving images to save storage space and transmission

H.264/AVC 動画像処理における高速アルゴリズム およびハードウェア設計に関する研究

H.264/AVC 動画像処理における高速アルゴリズムおよびハードウェア設計に関する研究