Lecture 10 Video Coding Cascade Transforms H264, Wavelets

Lecture 10 Video Coding Cascade Transforms H264, Wavelets H.264 features different block sizes, including a so-called macro block, which can be seen in following picture: (Aus: Al Bovik, Ed., "The Essential Guide to Video Processing", 2009). Macro blocks have the size of 16x16 samples, and can be subdivided, as can be seen in the picture. H.264 also offers the possibility of different transform block sizes, starting with 8x8 transforms, which can be divided into smaller blocks, down to 4x4 transforms, for which we saw the integer transform last time. Macro blocks are used for motion estimation and common coding. For the common coding, assume we have a 16x16 macro block and 16 4x4 transforms in the macro block. The 16 DC coefficients of these transforms are taken into a new block,

which is then again transformed, but this time with a WHT instead of the integer DCT. The integer DCT was also tried, but it was found that for these DC coefficients it has no advantage compared to the WHT, but the WHT is simpler to implement and leads to smaller subband coeffients, which need fewer bits (see: H. Malvar et. al: Low Complexity Transform and Quantization in H.264/AVC, IEEE Trans. on Circuits and Systems for Video Technology, July 2003). This structure can be seen in the following picture: (From: Richardson, "H.264 / MPEG-4 Part 10 White Paper" www.vcodex.com, 2003)

Dynamic range of values after the transform: (important for the fixed wordlength integer arithmetic). Assume we have an input signal with a maximum value of A, for instance an image with brightness levels A (for the worst case this would be the maximum value). Then we have a signal vector containing the values +-A (for instance for the chrominance values, which can also be negative), which is here multiplied from the right hand side. If we take the transform matrix H from last time, and if one column of x has for instance the values [A,A,-A,-A] (as a worst case again) then the multiplication with the second row of H results to 6A. This would also be the maximum value for this matrix. If we then also transform the rows of our image, we get a maximum value of 6*6A= 36A. This means that the dynamic range we have for our subband coefficients increases by log2(36)=5.17 bits compared to the dynamic range of the original images (we would need 5.17 bits or rather 6 bits more for our fixed wordlength integer arithmetic). This is an overhead which we need to provide in our coding signal processor. This is also the reason why we wanted to have our factor as small as possible. For the inverse matrix, in the decoder, the factor is somewhat smaller. Here we get a factor of 4, leading to a factor of 16 for rows

and colums, and hence 4 additional bits for the dynamic range for the decoded subband samples. Observe that this also means a reduced (maximum) information content in the subband signals, which is the result of the quantization in the encoder. These effects become important if we want to implement our algorithm with integer arithmetic, with limited word length. H.264 is made such that it can be implementated with 16 bit arithemtic (in encoder and decoder), which enables the implementation into cheap hardware.

Wavelet Approaches Back to the cascaded transform. The collection of DC coefficients into a new block with a following transform can also be seen as a tree structure subband decomposition: h 4 x h 3 h 2 h 1 DC Coefficients h 4 h 3 h 2 Split DC Coefficients h 1 (left: DCT, right: WHT) This is the analysis filter bank structure for the encoder, for the decoder we need the synthesis filter bank, which is the reverse structure with upsamplers instead of downsamplers. This particular structure is used in H.264, but different but similar structures can be found in other coders. This cascaded tree-structured subband

decomposition is also called a Discrete Wavelet Transform (DWT). Another type of Wavelets is used in JPEG 2000, which is an image coder, but whose algorithm is also used in Motion JPEG. Motion JPEG is a video coder, which does not use motion compensation, but encodes each frame individually as an JPEG image (see also http://en.wikipedia.org/wiki/motion_jpeg). This is used e.g. in digital cameras. A sequence of JPEG2000 encoded pictures is used for instance for digital cinema (http://en.wikipedia.org/wiki/digital_cinema). The equivalent DCT and WHT filters are not particlarly good filters, because they are only as long as the number of subbands we use. To solve this problem, longer Wavelet filters where designed, most often for the 2 band case, where we only have 2 subbands, which are then cascaded. The so-called Daubechies (9;7) Filter pair, uses an anlysis lowpass filter with impulse response of length 9:

The corresponding frequency response is The analysis high-pass filter impulse response has length 7

The corresponding frequency response is What is interesting here is that we have a very high attenuation around DC, which is important for images because most energy is concentrated there, and in this way we avoid "crosstalk" of this energy to the higher subband. During filter design this is obtained by placing as many zeros as possible at frequency zero. This can also be seen in the follwoing pole-zero plot,

Pole-zero plot (z-domain, see also lecture "Advanced Digital Signal Processing" in Moodle2, slides 7, no password required) of the transfer function of the Daubechies analysis high pass. This type of wavelet filters is, for this reason, also called "maximally flat" (which refers to the low pass filter, good attenuation of high frequencies means a relatively smooth impulse response). We see the impulse responses are symmetric around their center. This is connected to the zeros in the pole/zero diagram being conjugate reverse around the unit circle, one at 1/3, the other at 3 (see also lecture Advanced Digital Signal Processing, slides 8 and 9). This leads to linear phase filters, which means their

group delay (the derivative of the phase towards frequency, see ADSP) for different frequencies is the same for all frequencies. This is important for edges in the image, which contain many frequencies, and which need to stay together to obtain sharp edges after filtering. Using this 2-band filter bank, we can built a tree structure to obtain higher frequency resolution at low frequenies, as can be seen in the following picture, Rows Colums 2x2 band Analysis for images: ( mean Low Pass)

Synthesis: + Insertion of a zero after each sample + +