STUDY AND IMPLEMENTATION OF THE MATCHING PURSUIT ALGORITHM AND QUALITY COMPARISON WITH DISCRETE COSINE TRANSFORM IN AN MPEG2 ENCODER OPERATING AT LOW BITRATES Vidhya.N.S. Murthy Student I.D. 1000602564 Project report for Multimedia Processing course (EE5359) under Dr. K.R. Rao
Introduction The existing video coding standards result in a number of unacceptable artifacts such as blockiness and unnatural object motion when operated at very low bit rates. Since these techniques use only the statistical dependencies in the signal at a block level and do not consider the semantic content of the video, at very low bit rates (high quantization factors) artifacts are introduced at the block boundaries. Usually these block boundaries do not correspond to physical boundaries of the moving objects and hence, visually annoying artifacts are introduced. Unnatural motion arises when the limited bandwidth forces the frame rate to fall below that required for smooth motion. Hence there is a need for newer techniques to improve coding efficiency. Though standards like H.264 have been able to push compression ratios higher at the cost of increased computational complexity a lot of scope exists to improve compression performance in error prone low bitrate environments. One such approach is an algorithm called matching pursuits. Matching Pursuits DCT is a part of nearly all video coding standards. This popularity can be attributed to the fact that DCT performs well in a wide variety of coding situations. Unfortunately, block-based DCT systems have trouble coding sequences at very low bit rates. At rates below 20 kb/s, the number of coded DCT coefficients becomes very small, and each coefficient must be represented at a very coarse level of quantization. The resulting coded images have noticeable distortion, and block edge artifacts can be seen in the reconstruction. The block-dct residual coder is replaced with a coding method which behaves better at low rates. Instead of expanding the motion residual signal on a complete basis such as the DCT, the signal is expanded on a larger, more flexible basis set. Since such an overcomplete basis contains a wider variety of structures than the DCT basis, and is better able to represent the residual signal using fewer coefficients. The expansion is done using a multistage technique called matching pursuits [5]. This technique was developed for signal analysis by Mallat and Zhang [1], and is related to earlier work in statistics [2]. The matching pursuit algorithm, as proposed by Mallat and Zhang [1], expands a signal using an overcomplete dictionary of functions. A redundant or an overcomplete dictionary means a redundant set of basis functions. For example consider an N dimensional vector in the R N space. If there are N orthogonal vectors, then these form a complete basis for all the vectors in the R N space. Now suppose if more number of basis functions are added to this set then an overcomplete basis or a redundant dictionary is produce. This redundancy allows us to represent vectors in different ways. Algorithms like matching pursuits try to find the sparsest representation for a signal or vector using an overcomplete dictionary. The procedure can be illustrated with the decomposition of a one-dimensional (1-D) time signal. Suppose if a signal h(t) has to be represented using basis functions from the dictionary set where individual dictionary functions can be denoted as g k (t) Є G (1) Here k is an indexing parameter associated with a particular dictionary element. The decomposition begins by choosing to maximize the absolute value of the following inner product p = <h(t),g k (t) > (2) p is an expansion coefficient for the signal onto the dictionary function. A residual signal is computed as R(t) = h(t) - p.g k (t) (3) This residual signal is then expanded in the same way as the original signal. The procedure continues iteratively until either a set number of expansion coefficients are generated or some energy threshold for the residual is reached. Each stage n yields a dictionary structure specified by k n, an expansion coefficient p n, and a residual R n which is passed on to the next stage. After a total of M stages, the signal can be approximated by a linear function of the dictionary elements M ĥ(t) = Σ p n g n (t) (4) n = 1 The above technique has some very useful signal representation properties. For example, the dictionary element chosen at each stage is the element which provides the greatest reduction in mean square error between the true signal h(t) and the coded signal ĥ(t). In this sense, the signal structures are coded in order of importance, which is desirable in situations where the bit budget is very limited. For image and video coding applications, this means that the most visible features tend to be coded first. Weaker image features are coded later, if at all. It is even possible to control which types of image features are coded well by choosing dictionary functions to match the shape, scale, or frequency of the desired features. An interesting feature of the matching pursuit technique is that it places very few restrictions on the dictionary set. The original Mallat and Zhang paper [1] considers both Gabor and wavepacket function dictionaries, but such structure is not required by the algorithm itself. Mallat and Zhang showed that if the dictionary set is at least complete, then ĥ(t) will eventually converge to h(t), though the rate of convergence is not guaranteed [1]. Convergence speed and thus coding efficiency are strongly related to the choice of dictionary set. However, true dictionary optimization can be difficult since there are so few restrictions.
This method is next extended to the two dimensional case of images. Neff and Zakhor use an overcomplete collection of 2-D Gabor functions[5],[13],[14]. The 1-D gabor functions are defined as a set of scaled and modulated Gaussian windows. g α (i) = K α. g ( ( i N/2 + 1 )/s). cos ( 2πξ(i N/2 + 1 )/16 + Φ) (5) i Є {0,1,...,N-1} g(t) = 4 2e -πx where x = t 2 (6) In (5) and (6) above α is a triple (s,ξ,φ) where s is the positive scale, ξ is the modulation frequency and Φ is the phase shift. The 2 D seperable Gabor functions can therefore be specified as G α,β (i,j) = g α (i)g β (j) i,j Є {0,1,...,N-1} (7) These functions form the dictionary set. They are pictured as shown in Fig 1. Fig 1 The 2-D seperable Gabor dictionary.with variable basis image sizes [5] The seperable property plays an important role in the optimization with respect to performance of this technique. As an extension of 1-D matching pursuit technique the 2-D dictionary structures are examined at every integer pixel location of the image and the resulting inner products are computed. Henceforth only the 2-D case is discussed. Implementing matching pursuits for video compression Algorithm Breakdown The algorithm consists of two major components 1. Dictionary Design Dictionary design is an important issue since dictionaries can be designed to improve coding efficiency or to reduce complexity. For the current implementation an overcomplete 2-D Gabor dictionary was used. 2. Find atoms When applied to a video codec, matching pursuit decomposes motion residual into a weighted combination of basis functions over multiple stages. The basis function is searched such that the inner product with the signal is a maximum
or above a particular threshold. The atom comprises of the following parameters: a. The parameters defining the basis function (scale factors, modulating frequencies and phases). These are defined by triples that go into making the 1-D basis functions from which 2-D basis functions are generated. b. The coordinates of the position where the inner product was maximum. This is determined using the position coding method developed by Neff and Zakhor [13]. c. The value of the inner product. The atoms are coded into the bit stream. The decoder will reconstruct the residue error using the parameters of the atom. In the current project component 2 of the algorithm was implemented. The set of triples used for generating the basis images are from [5]. These are tabulated below Table 1: Dictionary triples and associated sizes Table 1 shows varying basis image sizes. This makes it possible for the basis to adapt to various kinds of discontinuities in the picture effectively. However in this current implementation, for the sake of simplicity all the basis images are of size 16x16. Thus using all of the above combinations 400 basis images were obtained. A DCT based MPEG2 encoder block diagram is shown in figure 2. In the current implementation, the matching pursuits module replaces the DCT and IDCT modules in an MPEG2 encoder. The encoder source code is from [8]. The matching pursuits algorithm is applied to the motion residual alone i.e. residue generated for P and B frames alones since motion residual errors are smaller in MPEG2 and the energy content of the I frame residues is larger which means that the algorithms would take more number of iterations to converge which would also mean larger number of atoms.
Reference frames + IDCT Inverse Quantization Frame Predictor VLC Video In Motion Estimation + DCT Quantization Bitstream Fig 2: The modules boxed by dotted lines will be replaced by matching pursuits in a MPEG-2 encoder Fig 3: Matching pursuits incorporated into a video encoder [5] The atom search or find atoms stage is explained next with the help of a flowchart shown in Figure 4. The atoms are found in num_iter stages or iterations. Hence there are num_iter atoms at the end of the procedure. The motion residue is generated in the conventional manner. This resiude is the input signal to the matching pursuits module. The residue signal is divided into blocks of size 8x8 each and the energy of each block is calculated. The block with the highest energy is found and a search window of size 16x16 is defined around the center of the block. Each basis image is then centered
Fig 4: Flowchart for position coding method for atom search (generated using Edraw Mind Map tool) around each location in the search window and corresponding inner products are found. Once the search is completed around each location the resulting inner products are compared and the basis image yielding the maximum inner product at a location (x,y) in the residue signal is designated as an atom. The atom is reconstructed and subtracted from the residue and this yields the signal for the next stage. The process is repeated iteratively till the number of iterations is equal to num_iter. The process is the same for luma and chroma samples. Results The results of this experiment are as follows. First the effect of increasing the number of stages or increasing the number of atoms for two QCIF sequences Hall monitor and Foreman is shown. This is depicted in figure 5. The reconstructed pictures are shown in Figure 6 and are compared with MPEG2 encoded pictures at 20kbps. The comparison is carried out on luma components alone. The coding method followed captures the features of the image in a hierarchical order of importance. This property imparts inherent scalability in the coding. Figures 5 captures the manner in which the reconstructed motion residue gets refined as the number of coded atoms increases. Figures 6 show reconstructed pictures with increasing number of atoms. The position coding method approximately uses an average 24 bits to code an atom[5]. Figure 8 depicts degradation in picture quality due to blocking artifacts in MPEG2 encoded pictures. The Hall and Foreman pictures were encoded as MPEG2 P frames at 20kbps. Due to the availability of a larger number of structures in the basis images to compare the residue signal with, the signal is better approximated using the matching pursuits method.
(a) (b) (c) (d) (e) (f) (g) (h) Figure 5: Atom decomposition of Hall and Foreman. (a) Motion residue generated for Hall. (b). first 5 coded atoms of Hall. (c) first 32 coded atoms of Hall. (d) first 64 coded atoms of Hall. (e). Foreman motion residue. (f) first 5 coded atoms of foreman. (g). first 32 coded atoms of foreman. (h) first 64 coded atoms of foreman.
(a) (b) (c) (d) (e) (f) Figure 6: Reconstruction of the Foreman and Hall sequences using 5 atoms in (a) and (e), 32 atoms in (b) and (e) and 64 atoms in (c) and (f). (a) (b) Figure 7 The same Hall and Foreman frames encoded using MPEG2 at 20Kbps. The blocking artifacts due to DCT at low bitrates are clearly visible. Algorithm Complexity and Implementation issues This implementation was done considering conditions like enormous processing power and off-line encoding. The matching pursuit algorithm in this particular implementation examines all possible 2-D structures of the dictionary set a large number integer pixel locations in the picture to get the closest matching atom. But this would render the search unmanageable and increase complexity manifold with the situation getting worse with increasing picture dimensions. To get an idea of the number of calculations involved: The implementation uses 400 basis images of size 16x16. The entries of the basis image matrix were float values (IEEE 752). A QCIF image (dimensions 176x144) has 25344 pixels. If the image were to be coded using 64 atoms then 64*16*16 locations (search window size 16x16) would have to be searched using 400 basis images of size 16x16 at each location. This would correspond to 167 million floating point multiplications. Comparing this with 8x8 DCT which involves 64 basis vectors for 396 blocks and even if fast DCT is not implemented the total number of floating point multiplications would be 1.6 million. Hence the number of operations increases by a factor of 100. This is definitely one of the crucial factors to be considered if matching pursuits were to be incorporated into existing video compression standards. Proposed Fast Methods Several approaches have been proposed to speed up the algorithm to make it useful for real time encoding and decoding. One approach is described in [5]. This method exploits the seperable property of the 2-D Gabor basis functions. A more recent
approach [9] splits the residue signal in a picture into 4 sub-bands, constructs dictionaries for each sub-band and then performs atom search. This method reduces complexity due to lighter inner products owing to reduction in the resolution of the sub-band image and basis function length in the dictionary. Yet another approach [10] converts the matching pursuits problem into a vector quantization problem and makes use of available fast vector quantization algorithms to achieve speed. [11] classifies the dictionary for matching pursuits into a tree structure such that the search for atoms is directional that is according to similarly grouped basis functions. [12] proposes integer matching pursuits which helps in eliminating floating point operations. Conclusions Thus this implementation demonstrated the effectiveness of a Matching pursuit video encoder. Though this coding paradigm is very effective at low bitrates, it is computationally very complex and hence future enhancements will be more towards reducing the number of searches and looking for better dictionaries which will also in turn assist in reducing the number of searches. Software The software can be downloaded from [15]. References [1] Z, Zhang, and S. Mallat, Matching pursuit with time-frequency dictionaries,ieee Transactions on Signal Processing,Vol 41, No. 12,pp. 3397-3415, Dec 1993. [2] J. H. Friedman and W. Stuetzle, Projection pursuit regression, J. Amer. Stat. Assoc., vol. 76, no. 376, pp. 817 823, Dec. 1981. [3] F. Bergeaud, and S. Mallat, Matching pursuit of images, Image Processing, 1995. ICIP 1995. IEEE International Conference on, pp. 53-56, Sept 1995. [4] M. Vetterli, and T. Kalker, Matching pursuit for compression and application to motion compensated video coding, Image Processing, 1994, ICIP 1994, IEEE International Conference on, pp. 724-729,Nov 1994. [5] R. Neff, and A. Zakhor, Very-Low Bit-Rate Video Coding Based on Matching Pursuits, IEEE Transactions on circuits and systems for video technology, Vol 7 No. 1, pp. 158-171, Feb 1997. [6] J. Pearl, H. C. Andrews, and W. K. Pratt, Performance measures for transform data coding, IEEE Trans. Commun., vol. COM 20, pp. 411 415, June1972. [7] P. Yip and K. R. Rao, Energy packing efficiency for the generalized discrete transforms, IEEE Trans. Commun., vol. COM 26, pp. 1257 1261, Aug. 1978. [8] Open software on MPEG2, http://www.mpeg.org/mpeg/video/mssg-free-mpeg-software.html. [9] K. Imammura et al, A fast matching pursuits algorithm based on sub-band decomposition of video signals,ieee ICME 2006, pp. 729-732,July 2006. [10] K. Cheung and Y. Chan, An efficient algorithm for realizing matching pursuits and its applications in MPEG4 coding system, Image Processing, 2000. ICIP 2000. IEEE International Conference on,vol 2, pp. 863-866,Sept 2000. [11] A. Shoa and S. Shirani, Tree structure search for matching pursuit Image Processing, 2005. ICIP 2005. IEEE International Conference on, Vol 3, pp 908-911,Sept 2005. [12] R. Neff et. al., Decoder complexity and performance comparison of matching pursuit and DCT based MPEG 4 video codecs, Image Processing, 1998. ICIP 98. Proceedings. 1998 International Conference on, Vol 1, pp 783-787, Oct 1998. [13] R. Neff, A. Zakhor, and M. Vetterli, Very low bit rate video coding using matching pursuits, in Proc. SPIE VCIP, vol. 2308, no. 1, pp. 47 60, Sept. 1994. [14] R. Neff and A. Zakhor, Matching pursuit video coding at very low bit rates, in IEEE Data Compression Conf., Snowbird, UT, pp. 411 420, Mar 1995. [15] http://cnx.org/content/expanded_browse_authors?letter=m&author=vmurthy.