Advanced Video Coding: The new H.264 video compression standard

Advanced Video Coding: The new H.264 video compression standard August 2003 1. Introduction Video compression ( video coding ), the process of compressing moving images to save storage space and transmission bandwidth, is essential to a wide range of multimedia applications including DVD-Video, digital television, videoconferencing, mobile video and video streaming. Video compression technology has developed hand-in-hand with a series of key international standards, including MPEG-1 (video compression for CD-based playback), MPEG-2 (the heart of digital TV and DVD-Video), H.263 (video compression for conferencing applications) and MPEG-4 Visual (general-purpose video compression). The latest international standard for video compression is Advanced Video Coding (AVC), jointly published by the ITU-T as Recommendation H.264 and by ISO/IEC as MPEG-4 Part 10. The H.264/AVC standard builds on the legacy of earlier standards and supports highly efficient and robust compression of digital video information. This article gives a brief introduction to the structure, features and performance of H.264. 2. The Video CODEC In common with earlier video coding standards, H.264 / AVC does not specify how to compress ( encode ) video. Instead, the standard specifies the syntax of a bitstream containing coded video data and a method of decoding the data. The actual encoder design is left to the developer s discretion. In practice, a standard-compliant video encoder and decoder (CODEC) is likely to include the functions shown in Figure 1. original video frame encoder decoder decoded video frame Prediction Transform+ Quant Entropy Encoder Entropy Decoder Rescale + Inverse Transform Reconstruct previously coded frames previously coded frames Figure 1 Basic structure of H.264 CODEC Page 1 of 6

A video frame is processed in units of a macroblock, each corresponding to a 16x16 pixel region of the displayed frame. H.264 supports coding of YCbCr video data, in which a macroblock consists of 16x16 luminance (greyscale) samples, 8x8 blue chroma (colour difference) samples and 8x8 red chroma samples. A prediction is formed for each macroblock (or part of a macroblock) based on data that has been previously coded in the same frame ( intra prediction) or other video frames ( inter prediction using motion estimation and compensation) and this prediction is subtracted from the current macroblock. The residual data produced by this subtraction are transformed into a spatial frequency domain and quantized to remove less-significant information. The quantized values (together with header data) are compressed using an entropy encoder to form a coded bitstream. At the decoder, the process is reversed; the bitstream is decompressed, the values are rescaled ( inverse quantized ) and inverse transformed, a prediction is formed and added to the decoded residual and a decoded video frame is produced. The coded bitstream occupies a significantly smaller number of bits than the original, uncompressed video sequence. This reduction in bitrate (compression) comes at the cost of decoded image quality. Typically, higher compression ratios lead to a reduction in image quality at the output of the decoder. The basic structure described above is found in all of the main standards for video coding. What makes H.264 different from previous standards is that the coding process is optimised to give better compression performance (better image quality for the same compression ratio, or higher compression for the same image quality). The H.264 compressed syntax is designed to support effective, robust transmission over a range of network channels. 3. H.264 Features H.264 incorporates a number of features that help to provide efficient compression and effective transport. These features are grouped together in Profiles, each defining a set of coding features that should be supported by an encoder or decoder. Three Profiles have been defined, Baseline, Main and Extended. Intra prediction: In an intra-coded video frame (compressed without any prediction from other frames), each 16x16 macroblock or 4x4 block of image samples is predicted from neighbouring, previously-coded samples. The encoder chooses one of a number of spatial prediction modes and attempts to find a good match for the current 16x16 or 4x4 block. Intra prediction significantly improves the compression of intra coded frames. Tree-structured motion compensation: Macroblocks in an inter-coded frame are predicted from previously coded frames using motion compensation. In H.264, each 16x16 macroblock may be subdivided into smaller blocks for motion compensation (16x8, 8x16, 8x8, 4x8, 8x4 or 4x4 luma samples together with corresponding chroma samples). Each of these smaller blocks (macroblock partitions or sub-macroblock partitions) is predicted separately using motion compensated prediction. A motion vector for each block is sent to the decoder along with the coded residual data. Small motion compensation block sizes are particularly effective for motion compensated prediction of complex areas of a video scene. Figure 2 shows a residual (difference) frame, created by subtracting a previous frame from the current video frame, with the choice of motion compensation block sizes superimposed. A large, 16x16 block size is chosen for unchanging background areas of the scene whilst smaller block sizes are applied to parts of the scene where there is significant movement and/or detail. Page 2 of 6

Figure 2 Residual showing motion compensation block sizes Multiple prediction references: H.264 enables an encoder to choose a prediction reference frame for each macroblock (or each 16x8, 8x16 or 8x8 partition of a macroblock) from one of a number of previously coded frames, before or after the current frame in display order. The encoder can pick the best prediction reference for each individual macroblock or partition (i.e. a reference frame that minimises the compressed data size). Using so-called P prediction, each macroblock or block is predicted from one previously coded frame. Using B prediction (available in the Main and Extended Profiles), each macroblock / block is predicted from one or two reference frames. Transform and quantization: Residual data are transformed using a 4x4 DCT-like transform that can be implemented using integer arithmetic. Quantization is integrated with the transform stage to minimise the number of multiplications. Entropy coding: Quantized transform coefficients and side information (including motion vectors, prediction mode choices and headers) are entropy coded using variable-length codes (all Profiles) or arithmetic coding (Main Profile only). If variable-length coding is used, quantized transform coefficients are coded using a context-adaptive scheme (CAVLC) and other syntax elements are coded with universal variable length codes. The Main Profile includes the option of Context Adaptive Binary Arithmetic Coding (CABAC) which offers better compression performance at the expense of a potential increase in complexity. De-blocking filter: At high compression ratios, decoded frames tend to exhibit obvious blocking distortion. The standard describes a filter that reduces blocking distortion whilst preserving genuine image features. The filter is part of the encoding and decoding loop and so can increase compression performance (by improving the quality of prediction reference frames) as well as enhancing the subjective quality of decoded video. An example of an uncoded video frame is shown in Figure 3; Figure 4 shows the frame after coding and decoding with H.264 (without any filtering) and Figure 5 shows the result of applying the deblocking filter. Page 3 of 6

Figure 3 Original frame Figure 4 Decoded (no filter) Figure 5 Decoded (filtered) Transport: Practical video coding applications involve transport or storage of compressed video and H.264 includes a number of features that improve transmission effectiveness and robustness. Coded video data and headers are grouped into Network Abstraction Layer (NAL) units, each containing a set of headers or compressed video data. NAL units are well-suited to packetised transport schemes such as the Internet Protocol (IP). Features such as Data Partitioning (sending different components of coded video data in separate NAL units) and slice groups (flexible ordering of macroblocks within a coded frame) can help a decoder to recover from transmission errors. Switching I and P slices (SI/SP slices) provide support for efficient switching between multiple coded video sequences, useful for video streaming applications. 4. Performance Figure 6 compares the performance of H.264 with the well-known video coding standards MPEG-2 (Video) and MPEG-4 (Visual). The four images each show part of a video frame from a sequence captured at a resolution of 352x288 pixels (Common Intermediate Format, CIF) and a frame rate of 25 frames per second. The original is at the top-left of the Figure and each of the other images show the same frame after compression and decompression. In each case, the compressed bitrate is 150kbits per second. After compression and decompression using MPEG-2 (top-right), the frame is heavily distorted. With MPEG-4 Visual compression (Simple Profile; lower-left), the frame is clearer but there are still some visible artefacts. H.264 compression (Baseline Profile; lower-right) produces clearly better quality, with no Page 4 of 6

obvious artefacts and only a slight loss of detail (for example, the texture of the table is not as clear as the original). Original MPEG-2 (150kbps) MPEG-4 Visual (150kbps) Figure 6 Quality comparison (CIF, 25 frames per second) H.264 (150kbps) Figure 7 plots compressed bitrate (x-axis) against Peak SNR (y-axis). PSNR is a widely-used measure of visual quality; a larger value of PSNR indicates higher decoded quality. The source video sequence is the same as the example shown in Figure 6. At every bitrate, the quality of the sequence compressed using H.264 is higher than that of the MPEG-4 sequence. Page 5 of 6

44 42 Luminance PSNR (db) 40 38 36 34 H.264 MPEG-4 Visual 32 30 0 100000 200000 300000 400000 500000 600000 700000 800000 Bitrate (bps) Figure 7 Rate-distortion comparison (CIF, 25 frames per second) 5. Conclusions H.264/AVC, finalised and standardised in 2003, can clearly out-perform older video coding standards such as MPEG-2 and MPEG-4 Visual. The new standard uses the well-tried structure of prediction, transform coding and entropy coding but optimises each stage to achieve excellent compression performance. The H.264 syntax is designed to support flexible and robust transport over a range of networks. The benefits of H.264 come at a price of increased computational complexity (compared with previous coding standards). Despite the increased processing power required to implement H.264, the standard is a strong contender to provide the video compression technology for the next generation of multimedia applications. Further reading 1., H.264 and MPEG-4 Video Compression, John Wiley & Sons, 2003. 2. www.itu.int International Telecommunication Union (ITU), publishers of the H.264 standard. 3. http://bs.hhi.de/~wiegand/publications.html Several papers co-authored by Thomas Wiegand including an overview of H.264/AVC. << Go Back Page 6 of 6