A Novel Approach for Image Compression using Matching Pursuit Signal Approximation and Simulated Annealing

A Novel Approach for Image Compression using Matching Pursuit Signal Approximation and Simulated Annealing Ahmed M. Eid Amin ahmedamin@ieee.org Supervised by: Prof. Dr. Samir Shaheen Prof. Dr. Amir Atiya Computer Engineering Department Faculty of Engineering Cairo University Giza, Egypt August 12, 2005

Abstract Signal approximation using a linear combination of basis from an overcomplete dictionary has been proven to be an NP-complete problem. By selecting a smaller number of basis than the span of the signal, we achieve lossy compression in exchange for a small reconstruction error. Several algorithms have been researched that reduce the complexity of the selection problem, sacrificing the optimality of the solution. The Matching Pursuit (MP) algorithm has been used for signal approximation for over a decade. Many variations have been proposed and implemented to enhance the performance of the algorithm. However, its greedy nature renders it sub-optimal. In this thesis, a survey of the different variations is provided. An enhancement for the MP algorithm is proposed that uses concepts from simulated annealing in improving the performance in terms of compression ratio and reconstructed quality. The algorithm is then applied to image signals. Results show superior compression to image compression standards for the same quality.

Contents 1 Introduction 1 1.1 Problem Definition and Motivation............... 2 1.1.1 Signal Representation................... 2 1.1.2 Dictionary Based Representation............ 3 1.1.3 Signal compression using sparse dictionaries...... 4 1.2 Thesis Outline........................... 5 2 Survey 6 2.1 Introduction............................ 6 2.2 Compression Methods...................... 6 2.3 Image Compression Standards.................. 12 2.3.1 JPEG........................... 12 2.3.2 JPEG2000......................... 17 2.4 Basis Dictionaries......................... 20 2.4.1 Generating the basis dictionary............. 20 2.4.2 Properties of the dictionary............... 20 2.5 Selection algorithms....................... 26 2.5.1 Method of Frames.................... 27 2.5.2 Matching Pursuit..................... 27 2.5.3 Orthogonal Matching Pursuit.............. 31 2.5.4 Basis Pursuit....................... 33 2.5.5 Natarajan s Order Recursive Matching Pursuit..... 34 2.5.6 Backward Elimination.................. 35 2.5.7 Forward-backward selection............... 36 2.6 Comparisons............................ 36 3 Augmenting Dictionaries for Image Compression 42 3.1 Introduction............................ 42 3.2 Dictionary partitions....................... 43 3.3 Results............................... 47 3.4 Conclusion............................. 51 i

4 Matching Pursuit with Simulated Annealing 52 4.1 Simulated Annealing....................... 52 4.2 Subset selection from a large search space........... 53 4.3 Matching Pursuit with Simulated Annealing.......... 54 4.4 Algorithm requirements..................... 54 4.4.1 Inputs........................... 56 4.4.2 E............................. 56 4.4.3 Annealing schedule T................... 57 4.4.4 Initial Pursuit....................... 58 4.5 Parameter Simulation....................... 58 4.6 Modified Matching Pursuit with Simulated Annealing..... 64 4.7 Results............................... 64 5 Results with Quantization and Comparing to the DCT 71 6 Conclusion and Future Work 91 6.1 Conclusion............................. 91 6.2 Future Work............................ 92 A Algorithms 95 ii

List of Figures 2.1 Image signal............................ 7 2.2 Image Signal after Fourier Transform.............. 8 2.3 JPEG encoder........................... 13 2.4 DC component encoding..................... 14 2.5 Zigzag ordering of AC coefficients................ 15 2.6 JPEG decoder........................... 15 2.7 Example of a JPEG encoding/ decoding process........ 16 2.8 JPEG2000 encoder........................ 17 2.9 Dyadic decomposition...................... 18 2.10 Example of a dyadic decomposition............... 19 2.11 Dictionary not covering the full signal space.......... 21 2.12 Dictionary covering the full signal space............ 22 2.13 Time-frequency localization................... 22 2.14 Haar basis with different parameters.............. 24 2.15 2D Gabor function........................ 27 2.16 Original Image Nature..................... 37 2.17 Matching Pursuit applied on the image Nature....... 38 2.18 ORMP applied on the image Nature............. 39 2.19 Comparison of Forward Selection Algorithms.......... 40 3.1 Cosine basis............................ 43 3.2 Mexican hat wavelet....................... 44 3.3 Symmlet 4 wavelet........................ 44 3.4 Daubechies 4 wavelet....................... 45 3.5 Gauss pulse............................ 45 3.6 2D Sinc basis........................... 46 3.7 Geometrical basis......................... 46 4.1 Graph of results in Table 4.3.................. 62 4.2 MPSA applied to the image Nature with update equation F (r) E = R(r) 2 R (r 1) 2.................. 66 iii

4.3 MPSA applied to the image Nature with update equation F (r) E = x(r) 2 x (r 1) 2................... 67 4.4 M-MPSA applied to the image Nature with update equation F (r) E = R(r) 2 R (r 1) 2.................. 68 4.5 M-MPSA applied to the image Nature with update equation F (r) E = x(r) 2 x (r 1) 2................... 69 4.6 Comparison of MP with the MPSA algorithms......... 70 5.1 Standard test images....................... 74 5.2 Lena, PSNR versus bit rate................... 75 5.3 Lena, Number of dot product calculations versus compression ratio................................ 78 5.4 Peppers, PSNR versus bit rate.................. 79 5.5 Peppers, Number of dot product calculations versus compression ratio............................. 82 5.6 Boat, PSNR versus bit rate................... 83 5.7 Boat, Number of dot product calculations versus compression ratio................................ 86 5.8 Test Pattern, PSNR versus bit rate............... 87 5.9 Test Pattern, Number of dot product calculations versus compression ratio........................... 90 iv

List of Tables 2.1 Example of Huffman encoding.................. 11 2.2 Comparison of Forward Selection Algorithms.......... 41 3.1 PSNR values obtained for different dictionary combinations.. 51 4.1 Comparison of E update methods............... 60 4.2 Number of runs exceeding the PSNR of MP for different parameters.............................. 60 4.3 Number of k, InitialCcount combinations exceeding the PSNR of the MP algorithm....................... 61 4.4 Comparison of forward selection algorithms with the proposed algorithms............................. 65 v

List of Algorithms 4.1 Subset selection using simulated annealing........... 55 4.2 Matching Pursuit with Simulated Annealing.......... 63 A.1 Matching Pursuit algorithm................... 95 A.2 Natarajan s ORMP algorithm.................. 96 A.3 Simple Greedy Algorithm.................... 97 vi

List of Symbols b : signal m : signal length n : dictionary size a k : basis A : dictionary x : coefficient vector ɛ : error tolerance Γ : currently selected basis from dictionary r : iteration number l : desired number of basis (compression) s : number of selected basis a, b : dot product of vectors a and b vii

Chapter 1 Introduction Image compression has been a widely tackled field for a long time. Several methods are available, and standards exist that provide high compression ratios for a given subjective quality. Although storage media is becoming cheaper, the requirement for compressed images still holds for communication through limited capacity networks, or for storage on pervasive computing devices with limited storage such as hand held devices. Several applications also suffer from large image data sets. Properties such as high resolution and high pixel depth lead to large images. Examples of such applications are space images and medical images.... Take for example, a high resolution xray image. Most of the image is a black background, with large areas of white bone. It makes sense to try to encode most of the background as one entity rather than encode every pixel in the image. The main objective of any data compression algorithm is to exploit any correlation or similarities within the data to be represented once rather than for every occurrence. Image compression also utilizes the fact that the human visual system is less sensitive to high frequency changes (e.g. edges) than low frequency changes. 1

1.1 Problem Definition and Motivation 1.1.1 Signal Representation Any image may be regarded as a discrete 2D signal comprised of p q pixels or samples. In binary black and white image, each pixel value is either 0 or 1 and thus takes one bit of storage. For grayscale images, each pixel value is between 0 (white) and 255 (black), and requires 8 bits of storage each. Color images come in different pixel depths defining the number of colors available for representing the pixel values. In a 16 color image system, each pixel requires 4 bits, giving a total of 16 colors. Higher resolution images are usually encoded as {R,G,B} byte records, where each byte encodes the intensity of red, green and blue colors. This is the simplest form of representation. Other forms exist, that do not necessarily represent the image in spatial space. The Fourier transform (2.1) is one example which encodes the image in the frequency domain. We will limit our discussion for the rest of this thesis to grayscale images. It is also common to treat each block of the image separately, where each block is (usually) a square subset of the image. The JPEG standard described in section 2.3.1 partitions the image into 8 8 blocks. This partitioning is done to alleviate the complexity of image processing algorithms where the complexity of the algorithm usually increases with the number of pixels/ samples being considered. A main disadvantage of such partitioning is the appearance of blocking artifacts on boundaries of neighboring blocks. This is due to the fact that discontinuities occur in the signal being considered between neighboring blocks. In image compression, larger blocks give higher compression, but require much more processing time. The actual choice of block size depends on the complexity of the algorithm,the desired fidelity, properties of the processing being applied.... Blocks are usually square, and common block sizes are powers of 2 (e.g. 8 8, 16 16, 64 64,...) 2

1.1.2 Dictionary Based Representation Any signal b can be represented as b = n x k a k (1.1) k=0 or in matrix form b = Ax (1.2) A R m n is called a dictionary. The signal b R m is represented as a weighted sum of the basis vectors a k, k = {1... n}, where x are the weighted coefficients. Every column in A, a k R m is called a basis or kernel, and we have a total of n basis. In general, we require to obtain the coefficient vector x that minimizes the error. For the rest of the discussion we will try to minimize the least square error min x b Ax 2 ɛ (1.3) or b = Ax + R (1.4) where R is the residual. In this case x may be obtained using the pseudo-inverse of A Ax = b (1.5) x = A b (1.6) or x = (A T A) 1 A T b (1.7) After that, the signal b may be reconstructed using equation (1.2) If m = n (and assuming that A is full rank), then we get a perfect reconstruction of b. If n > m, then we get a sparse vector x. If n < m then we will not get perfect reconstruction for the signal b. If the dictionary elements cover the signal space, and have similar proper- 3

ties as the signal, then we may indeed require less basis elements than signal sample to represent the signal with acceptable distortion. 1.1.3 Signal compression using sparse dictionaries As discussed in section 1.1.2, if the number of basis elements n is much greater than the length of the signal b (n >> m), we face the problem of selecting the best basis from the dictionary A. By applying equations (1.6) or (1.7), the resulting vector x will contain n coefficients. This is much greater than the m signal values that we already have! Therefore we need to select the basis elements from the dictionary that reduce the error of the reconstructed signal b below a certain tolerance ɛ, have a maximum of l coefficients that achieve a desired compression ratio Furthermore, if we desire to compress the signal (or image) using dictionary based representation, then we need to represent it using less values than the actual signal length. If the signal b R m has m coefficients, and we represent it using l coefficients, l < m, then we have reached the desired result. The problem is in selecting the best l basis, and calculating the resultant coefficient vector x. Several selection algorithms are discussed in the literature which perform basis selection based upon certain criterion [36], [38], [44], [46], [11], [8], [34], [30], [41], [13], [7]. Some authors also extend the discussion to discussion of complete image compression systems [29], [3], [35], [19]. Several issues arise when comparing such algorithms: sparseness of the solution complexity of the algorithm optimality of the solution selection criterion 4

constraints on the generating dictionary One approach to such a selection problem would be to perform an exhaustive search and find the basis elements which best represent the signal. This method is guaranteed to give the optimal result. However, this method is certainly prohibitive in cases of a large number of basis elements, since it requires ( n l ) for selecting l basis from a dictionary of n elements. Suboptimal solutions exist and are discussed in section 2.5. Of these algorithms, the Matching Pursuit algorithm 2.5.2 is widely used and provides agreeable results, but yet provides a suboptimal solution. It would be advantageous to build upon this algorithm by using heuristic techniques, and to improve the selection process, hence the reconstructed image quality, and a small increase in the complexity of the algorithm. 1.2 Thesis Outline Chapter 2 is a survey of image compression. Background information on dictionary based approaches is presented, as well as a survey of the existing methods for signal representation / compression. A comparison is done on some of the more significant methods, and results are presented. Chapter 3 discussed different dictionaries used in dictionary based image compression. In chapter 4 we propose a new algorithm based on simulated annealing, and discuss the different parameters for the algorithm. Results are given and compared to existing methods. The compression procedure is extended to quantize the resulting coefficients, and results are presented in chapter 5, including a comparison to the infamous DCT approach. A conclusion and areas for future work is given in chapter 6. 5

Chapter 2 Survey 2.1 Introduction In this chapter we provide the necessary background information on image and signal compression, a literature survey of existing methods, and a comparison of these methods. 2.2 Compression Methods Image compression is usually performed by the following steps: Preprocessing Transform Quantization Entropy Encoding Preprocessing The preprocessing phase may include image smoothing, noise elimination, or detection of regions of interest (ROI) for special handling. 6

Transform The transform module performs the most effective compression. It is the core of many image compression systems. Several transforms pack the coefficients into a smaller subspace. Take for example the Fourier transform. Assume we have a 1-D image signal as shown in figure 2.1. If the signal is transformed using the well know Fourier transform equation: X(ω) = ωt j x(t)e 2π dt (2.1) we obtain the plot shown in figure 2.2. Most of the coefficients are concentrated around the zero frequency, while a few are concentrated around the higher frequencies. This coefficient packing effect reduced the number of bytes required to encode the image signal. 240 220 200 180 160 140 120 100 80 60 0 10 20 30 40 50 60 70 Figure 2.1: Image signal A very similar transform is the Discrete Cosine Transform (DCT) which is described in section 2.3.1 on page 12. This is the transform used in the industry standard JPEG compression. An increasingly used transform is the wavelet transform, currently used in the JPEG2000 standard. 7

10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 10 20 30 40 50 60 70 Figure 2.2: Image Signal after Fourier Transform Quantization Quantization is the process by which the signal values are represented by a set of predefined intervals or cells. If these intervals are uniformly distributed then we get a uniform quantizer. b i = S r where S r b i S r+1 (2.2) and r : S r+1 S r = I where b i is the signal value to be quantized, S is the set of cells, S r is the value of cell r, and I is the uniform interval between the cells. Quantization, in its own sense is a form of compression since signal values are represented by a smaller subset of values. If for example, the signal range is the set of all integers between 1 and 256, then we require a minimum of 8 bits to encode each value. If we have a uniform quantizer, with an interval I = 2, we have 128 cells each requiring 7 bits to encode. The example gives higher compression if we have real valued samples rather than integers. This is a form of lossy compression since we cannot get an exact reconstruction of the signal. The error in the representation is b i S r. Using larger intervals will 8

lead to higher compression ratios, but will also lead to higher distortion in the approximated signal. To extend our discussion of quantization, we define a codebook as a lookup table that provides the actual codes of quantized elements. If we have 4 cells, a simple codebook would be { 0, 01, 10, 11 } to encode the cells {0, 1, 2, 3}. If we know that, statistically, most of the quantized elements fall under the cell 2, we may opt to give this cell a shorter code. Several methods exist that optimize codebooks depending on a set of training data, by observing characteristics of such sets. Arithmetic encoding and Huffman encoding are two examples. These methods are examples of entropy encoders. In addition to uniform quantization, a non-uniform quantizer is one that has different size partitions. For some data representation problems, we may allow higher distortion rates (hence, larger cell intervals) for parts of the signal, thus reducing the codebook required to encode such a signal. For example, in image compression using the JPEG standard, the distortion of low frequency coefficients should be kept as minimal as possible, while we may tolerate higher distortion for high frequency components due to the insensitivity of the human eye to higher frequency changes. The quantization table and codebook may be designed to optimally represent a certain class of signals. Most of the available quantizers use Lloyd s algorithm to optimize quantization tables and codebooks by iteratively reducing the distortion error on a set of training data. Vector quantization is another technique that approximates complete vectors/ signals to a single code, rather than quantizing each sample of the signal independently. This technique leads to better compression results in image processing [6]. Gray et al [22] provides an excellent review of quantization techniques. The above discussion quantizes each value in the signal separately. Usually, we can achieve higher compression if we can predict the next value of a signal x based on previous values of the signal. This technique is called differential pulse code modulation (DPCM). y(k) = p(1)x(k 1) + p(2)x(k 2) +... + p(m)x(k m) (2.3) 9

Here, y(k) aims to predict the value of x(k), p(j) is a predictor coefficient (p is an m value vector), and m is called the predictive order. DPCM encodes the prediction error y x. A good predictor that always gives an exact prediction of the signal will require to encode only a run of zeros (or smaller coefficients). This leads to a reduction in the number of bits required for coding. For the special case where m = 1, we get a delta predictor. Calculating the predictor coefficients is usually obtained by finding values that minimize the predictor error y x. Entropy encoding Entropy coding is a method by which different symbols are assigned codes that represent the probability of that symbol occurring. Rather than assigning fixed length codes to each symbol, those symbols that have a high probability of occurrence are assigned shorter codes than less frequently appearing symbols. According to Shannon s theorem, the optimal code length for a symbol i is log m P i where m is the number of symbols and P i is the probability of occurrence of the input symbol i. Non-adaptive schemes perform statistical collection on the input stream before coding to extract the frequency of occurrence of symbols, and the encoder and decoder are assumed to have the same probability distributions. Sending this data is an overhead for small data streams. Adaptive schemes observe and modify the input symbol probabilities, hence there must be a separate means of communicating the new values. The values of the probabilities are updated at both the encoder and decoder when required. Two of the most common entropy encoding techniques are Huffman encoding and arithmetic encoding [43]. Huffman encoders first arrange the symbols in order of decreasing probability. Then the two least frequent (assuming we are using binary output) symbols are aggregated and their probabilities summed up to give a new entry in the table. This process is repeated until we have two remaining entries in the table, and each is assigned either a binary 0 or 1. An example is shown in table 2.1. 10

Input symbol Huffman Code x 1 0.40 x 1 0.40 x 1 0.40 x 2, x 3, x 4, x 5 (0) 0.60 1 x 2 0.20 x 4, x 5 0.25 x 2, x 3 (0) 0.35 x 1 (1) 0.40 000 x 3 0.15 x 2 (0) 0.20 x 4, x 5 (1) 0.25 001 x 4 (0) 0.15 x 3 (1) 0.15 010 x 5 (1) 0.10 011 Table 2.1: Example of Huffman encoding If we used fixed length encoding for the 5 symbols, we would require 3 bits per symbol. After Huffman coding, the average bit per symbol is 1bit 0.40 + 3bits 0.6 = 2.2bits Arithmetic encoders output a single real number for a stream of input symbols. The output is always in the range [0, 1). First, the symbol probabilities are calculated, listed, and the cumulative probability is calculated. Then, the first symbol is input, and the space [0, 1) is subdivided according to the input symbol cumulative probability resulting into a subspace [x, y). The process is repeated until we reach the end of the input stream. For example, assume we would like to encode the stream HELLO. The initial table is Symbol Probability Range E 0.20 0.00-0.20 H 0.20 0.20-0.40 L 0.40 0.40-0.80 O 0.20 0.80-1.00 Low new = Low + P Low(symbol) (High Low) High new = Low + P High(symbol) (High Low) 11

Input Low High 0.0 1.0 H 0.2 0.4 E 0.2 0.24 L 0.216 0.232 L 0.2224 0.2288 O 0.22752 0.2288 Hence, HELLO is encoded as 0.22752. The decoding is done in an inverse manner as shown below Code new = Code Low Range Code Symbol Low High Range 0.22752 H 0.2 0.4 0.2 0.1376 E 0.0 0.2 0.2 0.688 L 0.4 0.8 0.4 0.72 L 0.4 0.8 0.4 0.8 O 0.8 1.0 0 This floating point representation will require less bits that using the standard fixed-length method. It is stated in [43] that it is more optimal than other encoding techniques. However it suffers from a higher computational complexity. Several methods exist that use integer values and binary codes rather than floating point arithmetic. 2.3 Image Compression Standards 2.3.1 JPEG JPEG [45], named after the Joint Picture Expert Group committee that designed it, is a widely used image compression standard. It provides both 12

Figure 2.3: JPEG encoder lossy and lossless compression. We will limit our discussion to lossy compression of grayscale images. Pixel values are in the range 0 255. Figure 2.3 shows the main blocks of the JPEG encoder process 1. The image is partitioned into 8 8 blocks. The first step is the block FDCT 2 transform. The equation for 8 8 2D DCT is F (u, v) = 1 4 C(u)C(v) 7 x=0 y=0 7 f(x, y) cos (2x + 1)πu 16 1 u, v = 0 C(u), C(v) = 1 otherwise. 2 cos (2y + 1)πv 16 (2.4) The transform decomposes the block into frequency components. The topleft value when u, v = 0 can be regarded as the DC component, or the average of the block values. The rest of the 63 coefficients are the AC values. Coefficients close to the top-left component are lower frequency components, while those towards the bottom-right are higher frequency. Since the human eye is less sensitive to higher frequency changes, we may lose some of the data present in these coefficients. The DC coefficient contains most of the data and should be preserved, while most of the other values are very small or close to zero. The next step is quantization. Each coefficient is quantized based on a quantization table where entries are in the range 1 255, and 1 Figures in this section are extracted from [45] 2 Forward Discrete Cosine Transform 13

Figure 2.4: DC component encoding quantization is a simple division followed by rounding to the nearest integer. F Q F (u, v) (u, v) = Round( Q(u, v) ) (2.5) Less visually significant coefficients (higher frequency components) are divided by larger values to set them to zero. The quantization table is input by the user or by the application, and values are set to achieve the desired compression ratio. The quantized coefficients then undergo an entropy encoding step. The DC coefficients are treated differently from the rest of the coefficients. Since the DC coefficient is the average of the pixel values in the block, there exists a significant correlation between DC values of neighboring blocks. After encoding the DC coefficient of the first block in the image, subsequent DC values of other block are encoded as the difference between the current term and the previous block term. This is illustrated in figure 2.4. By encoding the smaller value differences, a significant reduction in storage bits is achieved. The rest of the coefficients (AC terms) in the block are ordered in a zigzag fashion as shown in figure 2.5. This places lower frequency coefficients that contain a higher fraction of the total image energy (they have higher values) before the higher frequency coefficients which are very close to zero. The coefficients are then transformed into symbols. The final step is entropy encoding these symbols using either Huffman coding or arithmetic coding [43].Huffman encoding requires that the Huffman tables in the encoder and decoder be identical. The application may provide 14

Figure 2.5: Zigzag ordering of AC coefficients Figure 2.6: JPEG decoder default tables, or calculate the table by statistical gathering during the encoding process. Arithmetic encoding provides 5 10% higher compression than Huffman coding, but is more computationally intensive. The decoding process is shown in figure 2.6 and is the reverse of the encoding process. The first step involves extracting the coefficients using an entropy decoder, followed by a dequantization step. F Q (u, v) = F Q (u, v) Q(u, v) (2.6) These values are input to the IDCT 3 block, where the decoded image pixel 3 Inverse Discrete Cosine Transform 15

Figure 2.7: Example of a JPEG encoding/ decoding process values are obtained f(x, y) = 1 4 7 u=0 v=0 7 C(u)C(v)F (u, v) cos (2x + 1)πu 16 1 u, v = 0 C(u), C(v) = 1 otherwise. 2 cos (2y + 1)πv 16 (2.7) Figure 2.7 shows a simple example of the different values obtained for the encoding and decoding of an image block 4. 4 This example was extracted from [45] 16

2.3.2 JPEG2000 JPEG2000 [23] provides superior compression and subjective quality to the baseline JPEG method, while also introducing other features for the more recent multimedia requirements [28]. Some of these features are Superior low bit-rate quality Lossless and lossy compression Progressive transmission Region of interest coding Robustness to bit-errors Figure 2.8: JPEG2000 encoder The encoder engine is shown in figure 2.8. The first part is the (optional) tiling of the image. This is exactly the same as partitioning the image into non-overlapping, equally sized blocks 5. As in the JPEG standard, each tile is treated independently, and each undergoes the encoding process separately. Each tile undergoes a DC level shifting operation, where the DC value (or average) of the tile is subtracted from the component values. The next step is the transformation. JPEG2000 has two wavelet transformations, the irreversible Daubechies 9/7 tap filter and the Daubechies 5/3 tap filter, corresponding to lossy and lossless compression respectively. Each tile undergoes L levels of dyadic decomposition (figure 2.9), where, at each level, the tile is decomposed into downsampled low frequency and high frequency components in the horizontal and vertical directions. 17

Figure 2.9: Dyadic decomposition An example is shown in figure 2.10. At the first level, the tile is decomposed (using the wavelet transform) into LL, LH, HL and HH sub-band coefficients, corresponding to (L)ow frequency horizontal-(l)ow frequency vertical (LL), (L)ow frequency horizontal-(h)igh frequency vertical (LH) (figure 2.10(b)). At the next level, the LL band is again decomposed into four sub-bands (figure 2.10(c)), and the procedure is iterated for L levels. As in the DCT transform, the high energy preserving coefficients are concentrated in the lower frequency sub-bands. Quantization is the next step in the process that results in zeroing out the smaller coefficients. Scalar quantization is used, with the quantization step depending on the dynamic range of the tile 6, and on the choice of wavelet used in the transform. The lossless JPEG2000 uses a step size of 1 to ensure no information is lost in this step. The resulting coefficients are ordered relevant to their importance, and finally undergo binary arithmetic encoding. The probability estimation is adaptive, and since the resultant code is either a 0 or 1, the binary decisions can often be coded in much less than one bit per decision. 5 the sizes may be different for the rightmost and lower tiles of the image 6 the number of bits used to represent the original image tile 18

(a) Original Image (b) L = 1 level decomposition (c) L = 2 level decomposition Figure 2.10: Example of a dyadic decomposition 19

The decoding process is essentially the reverse of the encoding. JPEG2000 however, provides special file/ stream formats for allowing progressive transmission and decoding, as well as easily decoding parts of the stream independently. 2.4 Basis Dictionaries 2.4.1 Generating the basis dictionary Assume we would like to create a basis dictionary that has the same performance as a weighted sum of cosine waves F (u) = C(u) m f(u) cos i=0 (2i + 1)πu 2m (2.8) The actual basis function is cos (2i + 1)πu 2m (2.9) Each column in matrix A, a k will be the coefficients of cos (2i + 1)πu 2m 2.4.2 Properties of the dictionary Several properties should be considered when selecting the basis function for the dictionary. The dictionary properties may be: 1. Full space coverage 2. Time-Frequency localization 3. Orthogonality 4. Orthonormal 5. Scale invariance 20

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Figure 2.11: Dictionary not covering the full signal space 6. Shift/ rotation invariance Full space coverage It is essential that the union of all dictionary basis cover the space occupied by the signal. Take for example, a two element dictionary shown in figure 2.11. No representation could be found to approximate any signal spanning the whole region, due to the deficiency of values in the dictionary in the space from 1 3 and from 8 10. A better dictionary with a larger number of elements is shown in figure 2.12 Time-Frequency Localization By extending the discussion of the previous section, the signal in figure 2.1 has several sharp transitions. In the time (or space) domain, this is called a local transition, while in the frequency domain, this is regarded as a high frequency component. The cosine basis dictionary shown in figure 2.13(a) may be suitable to express such transitions in the frequency domain. Low frequency waves will better represent background or smooth areas of an image block, while higher frequency waves will better represent the sharp edge or 21

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 Figure 2.12: Dictionary covering the full signal space transitions. However, they fail to localize the actual location of the edges. The basis in figure 2.13(b) is better suited to describe a signal transition. We say that this basis provides compact support in the range 0 1. 2 1 ψ 00 0 1 2 (a) Cosine basis 0 1 t (b) Haar basis Figure 2.13: Time-frequency localization To obtain a good representation for transitions at different temporal/ spatial locations for the signal, we need different shifts of the same basis 22

at multiple time/spatial locations (hence, a full space coverage). Different functions give different responses, hence representations. The short time Fourier transform (STFT) and sinc functions give different time-frequency localization of sinusoidal waves. Perhaps the most famous family of timefrequency basis are the wavelet basis. Any signal may be represented by a linear combination of a wavelet basis X(a, b) = 1 a x(t)ψ( t b )dt (2.10) a This is the continuous wavelet transform (CWT) over the continuous domain variables (a, b). If a and b have discrete values a = c k and b = c k n where k and n vary over the set of integers, we get the discrete wavelet transform (DWT). A further special case is the dyadic DWT, where c = 2 (i.e. a = 2 k, b = 2 k n) 7 x(t) = k= n= c kn 2 k/2 ψ(2 k t n) }{{} ψ kn (t) (2.11) where c kn is a coefficient. Different values of n give different shifts of the basis function, providing different support for time localization. Parameter k provides different dilations of the basis function, hence multiple frequency responses. In the CWT case (equation (2.10)), a may be regarded as a scale factor, which varies the amplitude of the basis function. To illustrate the discussion, figure 2.14 shows the different shifts and dilations of a Haar basis function. By increasing k we squeeze the basis, while decreasing k dilates the basis. Changing n translates the basis obtained. If we allow the basis function to vary over the complete signal space [ N, N] rather than [0, 1] we obtain the Haar wavelet basis. 7 Also called the wavelet series expansion 23

Figure 2.14: Haar basis with different parameters 2 1 ψ 00 0 1 2 0 1 t (a) ψ 00 1.5 1.5 1 1 0.5 0.5 ψ 10 0 ψ 11 0 0.5 0.5 1 1 1.5 0 0.25 0.5 0.75 1 t 1.5 0 0.25 0.5 0.75 1 t (b) ψ 10 (c) ψ 11 2 2 1.5 1.5 1 1 0.5 0.5 ψ 20 0 ψ 21 0 0.5 0.5 1 1 1.5 1.5 2 2 0 0.25 0.5 0.75 1 t 0 0.25 0.5 0.75 1 t (d) ψ 20 (e) ψ 21 2 2 1.5 1.5 1 1 0.5 0.5 ψ 22 0 ψ 23 0 0.5 1 24 0.5 1 1.5 1.5 2 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 t 0 0.25 0.5 0.75 1 t (f) ψ 22 (g) ψ 23

Orthonormal basis, orthogonal basis, and L 2 space The basis functions shown in figure 2.14 are all in L 2, or the set of square integrable functions. The L 2 norm 8 is defined as x(t) 2 = ( x(t) 2 dt) 1/2 (2.12) Functions belonging to L 2 [a, b] are zero outside the interval a t b. L 2 (R), or simply L 2 are functions that have support over t. A sequence of linearly independent functions g n (t) exist in L 2 such that any L 2 function x(t) can be expressed as x(t) = α n g n (t) (2.13) n for a unique set of coefficients α n. This is the exact equation we have in (1.1). We say that L 2 have orthonormal basis. An interesting property of orthonormal basis is that the coefficients α n can be obtained by a simple inner dot product equation α k = x(t), g n (t) (2.14) The dot product g n (t), g m (t) exists for any g n (t) and g m (t) in L 2. If g n (t), g m (t) = 0 (2.15) then we say that the functions are orthogonal. If all pairs of functions in a set of basis are orthogonal, and g n (t) 2 = 1 for all n, the set is called an orthonormal sequence. Theorem. Let {g n (t)}, 1 n be an orthonormal sequence in L 2. Define c n = x(t), g n (t) for some x(t) L 2. Then the sum n c n 2 converges, and n c n 2 x(t) 2. The above discussion gives rise to the Riesz-Fischer Theorem [42]. Riesz-Fischer Theorem. Let {g n (t)}, 1 n be an orthonormal 8 or simply norm 25

sequence in L 2 and let {c n } be a sequence of complex numbers such that n c n 2 converges. Then there exists x(t) L 2 such that c n = x(t), g n (t) This means that we can get an exact representation of the signal given a complete or overcomplete orthonormal dictionary. The L 2 space is more desirable that other L p spaces. The concepts of orthonormality and inner products are undefined in L 1. The Fourier transform has more time-frequency symmetry in L 2 than in L 1. Shift, rotation and scale invariance In image signals, parts of the signal may posses certain characteristics that make them easily represented with a few number of basis, thus achieving compression of the data [44]. These segments may be local to a certain subspace (or in general signal processing terms, it is localized in time) Therefore, it is desirable to have multiple copies of the same basis functions that describe this signal shifted in space (time). This has already been demonstrated in the Haar basis dictionary (figures 2.14(a) to 2.14(g)). The basis φ 2n are the same basis but shifted in time. This means that even if the signal is shifted in space, there will be a matching basis the provides compact support for it. A similar situation occurs when considering rotation in 2D space. If a certain 2D image signal may be perfectly matched by a single basis, a minor rotation of this signal will require a linear combination of more than one basis to approximate the image. It would be advantageous to have different rotations of the same basis function in one dictionary. Sample rotations of the basis function shown in figure 2.15(a) are shown in figures 2.15(b) and 2.15(c). 2.5 Selection algorithms Basis selection has been widely discussed since 1992 [5]. It is mainly a subset selection problem where we require to select the optimal, or suboptimal basis vectors from a large dictionary to best represent a signal. Several efforts in the literature compare such methods [46], [36], [11], [8], [31], [41], [7]. All 26

(a) (b) (c) Figure 2.15: 2D Gabor function of these algorithms are greedy, and lead to sub-optimal solutions. It has been shown in [32] that the optimal solution to the subset selection problem is NP-complete. Algorithms differ in the optimality, convergence, complexity, and basis selection criterion. 2.5.1 Method of Frames The method of frames [46], [4], sometimes called the minimum length solution selects coefficients that minimize the l 2 norm min x 2 subject to Ax = b (2.16) It provides a unique solution x that solves the linear system of equations x = A T (AA T ) 1 b (2.17) However, this method suffers from the fact that it is not sparsity preserving. In brief, any basis in the dictionary that has a non-zero inner product with the signal b will contribute to the solution. 2.5.2 Matching Pursuit Matching Pursuit was introduced by Mallat and Zhang in 1993 [27]. It aims to decompose the signal into a linear expansion of waveforms (basis 27

functions). It is an iterative greedy algorithm that selects a basis that best matches the current signal. A residual signal is calculated, and another iteration is performed to select the basis that best approximates the residual. The algorithm continues until the residual is below an acceptable error tolerance. Initially, the dictionary A should be normalized, and the residual R is set to b. The signal b can be represented as m 1 b = R k, A k A k + R m < ɛ (2.18) k=0 At the first iteration r = 0 A = 1, b = 1, R 0 = b In the first iteration, the algorithm selects the basis that maximizes the inner dot product of the dictionary and residual. The residual is then calculated as max R 0, a k (2.19) k R 1 = R 0 R 0, a k a k (2.20) After r iterations, the selected index k is Since the calculated residual is orthogonal to the selected dictionary element, max R r, a k (2.21) k R r = R r 1 R r 1, a k a k (2.22) r 1 ˆb = R k, a k a k + R r (2.23) k=0 R r = R r 1 R r 1, a k a k (2.24) 28

therefore, to minimize the next iteration residual, we need to select the basis function that maximizes the inner dot product of the current residual. Refer to algorithm A.1 for the actual algorithm. Despite its simplicity, the algorithm is sub-optimal. Several examples can be given that demonstrate that the greedy nature of the algorithm is not sparsity preserving. One such example is approximating a signal that is the superposition of two sinusoidal waves with a close frequency separation. Even by using a 4-fold sinusoidal dictionary, the algorithm fails to select the two basis functions that resemble the sinusoids. The first choice of the algorithm is the basis with the intermediate frequency of the two generating frequencies. The algorithm iterates to correct this error, resulting in unnecessary nonzero coefficients. Two other examples are shown in [4] Because of its greedy nature, the matching pursuit algorithm is not optimal. Several extensions to the algorithm have been made that enhance the performance (sparsity, reconstruction error) of the algorithm, reduce complexity, memory requirements... Mallat [3] discusses using the matching pursuit algorithm in an image compression system. Quantized matching pursuit By using the algorithms discussed so far, and given an appropriate complete or overcomplete dictionary, result in a perfect reconstruction of the signal. Coefficients are usually sparse, resulting in signal compression. When we consider lossy compression, for signals of complete images, a significant reduction would be to quantize the resulting coefficients. However, quantization will result in distortion of the reconstructed signal. In matching pursuit, a quantized coefficient will result in a significant error since the coefficient is obtained based on residuals calculated from other basis functions. Quantized matching pursuit [21] prevents this error propagation to all basis functions by quantizing the coefficient at every step of the matching pursuit algorithm, prior to calculating the residual. The results in [21] show improved performance over quantizing the coefficients after the matching pursuit has 29

terminated. Evolutionary methods for selection Figueras et al [14] has showed how to incorporate heuristic models to improve the performance of the matching pursuit algorithm. Genetic algorithms (GA) are used in the selection process. GAs are guaranteed to converge to local minima of the search space (the combination of selected basis) through operators such as crossover and mutation on genes. Here, the genes are the basis functions. An example would be to use a Gabor dictionary, and use parameters such as rotation, scale, shifts as alleles in the genes. The disadvantage of the algorithm is that the convergence is merely statistical. However, it is of lower complexity than a brute search of the complete basis dictionary. Further discussions on evolutionary pursuit may be found in [16] [12] Full search (brute force) matching pursuit Figueras et al [14] also gives results of applying a full search of the matching pursuit algorithm, where all possible combinations of basis are considered in the selection process. For such a case, the number of of subsets that need to be examined is N = ( n k) (2.25) for selecting the best k vectors out of a dictionary of n basis. M:L Matching Pursuit In the matching pursuit algorithm, we greedily select the best basis that maximizes (or minimizes) a certain criterion. At each step, we calculate the criterion function for all dictionary elements, and once the selection has been made, all this information is discarded. The MP:K algorithm [37] makes use of this information by maintaining the best K basis functions at each iteration. This leads to a tree of selected subsets. To enable the algorithm to backtrack to another path in the tree, the residuals and norms need to be 30

stored at each node. This approach requires a large numbers of nodes that grow exponentially as we increase the number of iterations r Another approach also described in #Nodes = Kr 1, K > 2 (2.26) K 1 [37] maintains the best K paths until a level L. At each level, after all nodes have been expanded, we maintain only the best K nodes all over the tree. Both of these approaches are still suboptimal, but lead to significantly better results. A similar approach called Partial Search is presented in [41]. 2.5.3 Orthogonal Matching Pursuit The basic matching pursuit, although guarantees asymptotic convergence as shown in equation (2.24), it does not necessarily provide the optimal approximation with respect to the current selected subset. This optimality is only achieved if the rth residual is orthogonal to all the selected basis (R (r) V (r), V (r) is the span of the selected subset at iteration r)9. Orthogonal matching pursuit (OMP) [26] is an attempt to improve the performance of the basic matching pursuit. At each step r, the algorithm solves the least square problem [4] b r x i a i (2.27) i=0 At each iteration, the coefficients are updated to ensure backward orthogonality of the current residual Assume we have the current representation b = R (r) V (r) (2.28) r x r i a i + R (r) (2.29) i=1 9 Note that the overall solution may still be suboptimal 31

The superscript of the coefficient r shows the dependence of the coefficient on the current order. Suppose we would like to advance to the (r + 1)th model. Assume we have the current representation r+1 b = i=1 x r+1 i a i + R (r+1), with R (r+1), a i = 0, i = {1, 2,..., r + 1} (2.30) Since the basis functions are not required to be orthogonal, to perform such an update, Pati et al [26] define an auxiliary model for the dependance of a r+1 on the previously selected a i s {1, 2,..., r} a r+1 = r b r i a i + γ r, with γ r, a i = 0, i = {1, 2,..., r} 10 (2.31) i=1 The update equation as stated in coefficients to upgrade to the r + 1th model is [26] for the new coefficient and older x (r+1) r+1 = x r The b i s may be calculated as follows = R(r), a k+1 γ r, a k+1 = R(r), a k+1 γ r 2 = R (r), a k+1 a r+1 2 r i=1 br i a i, a r+1 a (r+1) i = a r i a k+1 b r i, i = {1, 2,..., r} (2.32) v r = A r b r (2.33) where v r = [ a r+1, a 1, a r+1, a 2,..., a r+1, a r ] T b r = [b r 1, b r 2,..., b r r] T and 10 b r i is an intermediate variable 32

a 1, a 1 a 2, a 1... a r, a 1 a A r = 1, a 2 a 2, a 2... a r, a 2...... a 1, a r a 2, a r... a r, a r Hence, the vector b r may be obtained as Pati et al b r = A 1 r v r (2.34) [26] further derives equations that transform the algorithm in a recursive manner that makes it more efficient. 2.5.4 Basis Pursuit The basis pursuit algorithm assumes that the dictionary is overcomplete, thus many solutions exist. The algorithm aims to select the coefficients with the minimum l 1 norm. min x 1 subject to Ax = b (2.35) Basis pursuit is similar to the Method of Frames algorithm described in section 2.5.1. The difference is that the selection criteria aims to minimize the l 1 norm rather than the l 2 norm. This leads to a sparser solution than the Method of Frames. Basis pursuit converges to solving a convex, linear system of equations. Reference [4] states that BP is an optimization principle rather than an algorithm. Any linear programming system may be used to solve the linear system of equations. One example is solving the standard form min c T x subject to Φx = s, x 0 (2.36) In such a case, the following substitutions may be made c 1, 1, Φ A, s b (2.37) 33

Here, the simplex method or the interior point method may be used. 2.5.5 Natarajan s Order Recursive Matching Pursuit Natarajan [32] provides a novel algorithm for the solution of the problem stated in equation (1.2). It is very similar to the matching pursuit algorithm described in section 2.5.2 in that it greedily selects the basis function that best approximates the current residual signal. Initially, each column a k in dictionary A is normalized. The residual R 0 is initially set to b. Assume the set of selected signals Γ is initially empty : Γ = {}. At each iteration of the algorithm, the index k is chosen max R 0, a r k, k = 1, 2,..., n Γ (2.38) k Γ = Γ k where a r k is the column k in dictionary A in iteration r. The residual is then projected onto the subspace orthogonal to a r k R r+1 = R r R r, a r k a r k (2.39) Then, each unselected dictionary element is projected to the space orthogonal to a k then it is normalized a r+1 j = a r j a r k, a r j a r k, j = 1, 2,..., n Γ (2.40) a r+1 j = a r j/ a r j 2, j = 1, 2,..., n Γ The algorithm iterates until the norm of the residual R r is less than a predefined error ɛ. The final step in the algorithm is the solution step which solves for x Ax = b R (r) (2.41) Natarajan [32] states that the maximum number of selected indices, 34

hence the number of non-zero elements of the solution is at most ( 18Opt(ɛ/2) A 2 b 2 ) 2 ln (2.42) ɛ where Opt(ɛ/2) denotes the fewest number of nonzero entries over all solutions that satisfy Ax b 2 ɛ/2 Details of the algorithm are presented in algorithm A.2 2.5.6 Backward Elimination The methods discussed so far are all forward selection algorithms. The backward elimination method is an example of a greedy backward selection method. First introduced by Harikumar et al and later improved by Reeves [39], we will describe the latter approach. The goal of the algorithm is to minimize the error in the least square sense b Ax 2 2 (2.43) Starting with a non sparse solution, the algorithm iteratively (and in a sense greedily) sets one of the coefficients to zero. It is clear that setting a coefficient to zero will increase the least squared error. Therefore, the criterion for selecting which coefficient element is to be zeroed out is the one that increases the least squared error the least. Hence we need to minimize the least square error b Ax 2 = b A(A T A) 1 A T b 2 (2.44) = b T b b T A(A T A) 1 A T b Therefore, maximizing the term b T A(A T A) 1 A T b will lead to minimizing (2.43). Another discussion for backward elimination using overcomplete dictionaries is given in [10], [9]. 35

2.5.7 Forward-backward selection Similar to an exhaustive search, this algorithm starts by selecting a random initial set of basis. At each iteration, the algorithm proceeds by greedily adding a basis from the dictionary to the selected basis set that decreases the representation error, then removing from that set the basis function that results in the minimum error increase. It is essentially a marriage between a greedy forward selection algorithm, and a greedy backward selection algorithm. An implementation of such an algorithm is presented in [2]. 2.6 Comparisons In this section we will compare between some of the selection algorithms in the previous section. In figure 2.16 we show the standard image Nature used for the comparison. The standard image 11 was originally grayscale (512 x 512 pixels, 256 gray-levels). It was resized to half its size(256 x 256) using bicubic interpolation. The following algorithms were used to compress the image: Greedy Algorithm (Greedy) - Algorithm A.3 Matching Pursuit (MP) - Algorithm A.1 Natarajan s Algorithm (ORMP) - Algorithm A.2 Best Basis Pursuit (BP) Orthogonal Matching Pursuit (OMP) The compression is due to the reduced number of basis required to represent the image. No quantization or entropy encoding were performed. The x-axis represents the percentage of the number of basis selected, while the y-axis is the PSNR of the reconstructed image. P SNR = 20 log( 255 ) (2.45) 1 255 (R I) 2 2 11 from the University of Southern California SIPI standard database http://sipi.usc.edu/database/ 36

Figure 2.16: Original Image Nature Figures 2.17 and 2.18 show the reconstructed images after applying the matching pursuit and ORMP algorithms respectively. It is obvious that increasing the number of selected basis per block increases the subjective quality of the images. Results of all algorithms are given in table 2.2, and a graph of the results is shown in figure 2.19. From the table, the greedy and ORMP algorithms clearly outperform the rest in terms of reconstructed PSNR. OMP provides slightly improved performance than the basic matching pursuit, while the basis pursuit algorithm gives the poorest results. 37

(a) 5% of basis (b) 10% of basis (c) 50% of basis Figure 2.17: Matching Pursuit applied on the image Nature 38

(a) 5% of basis (b) 10% of basis (c) 50% of basis Figure 2.18: ORMP applied on the image Nature 39

Figure 2.19: Comparison of Forward Selection Algorithms 40

Numbasis Greedy MP ORMP BP OMP 5 24.47 24.24 24.47 19.28 24.41 10 28.09 27.44 28.09 21.69 27.93 15 30.77 29.79 30.77 23.19 30.52 20 33.23 31.83 33.23 24.58 32.87 25 36.45 34.32 36.45 26.25 35.92 30 38.96 36.14 38.96 27.45 38.25 35 41.65 37.90 41.65 28.65 40.71 40 44.57 39.64 44.57 29.79 43.34 45 47.74 41.37 47.74 30.95 46.16 50 52.54 43.65 52.54 32.56 50.26 55 56.61 45.35 56.61 33.85 53.63 60 61.21 47.05 61.21 35.21 57.39 65 66.47 48.76 66.47 36.67 61.62 70 72.63 50.44 72.63 38.24 66.34 75 82.80 52.72 82.80 40.60 73.63 80 92.57 54.41 92.57 42.51 80.17 85 105.45 56.10 105.46 44.60 88.15 90 124.11 57.78 124.11 47.03 98.46 95 155.91 59.46 155.91 49.70 112.37 Table 2.2: Comparison of Forward Selection Algorithms 41

Chapter 3 Augmenting Dictionaries for Image Compression 3.1 Introduction In this section, we aim to test different combinations of dictionaries. Previous work involves trying different dictionaries to test a new algorithm or test case, but no effort has been made to examine the effect of different dictionaries on the subset selection problem. Creating a dictionary for a certain application is still an open issue [34], [35], [29]. New directions involve learning the dictionary coefficients [31]. A dictionary may provide an infinite number of basis. In this section we only limit our dictionaries to a subset of some of the famous basis functions, and try to find any correlation between the dictionary and the compression for a sample test signal. It should be noted however, that the selection of a dictionary (or dictionaries) should incorporate knowledge of the domain of the signal to be represented. Natural images are different in nature from line art, from medical images, or in the general signal processing field, signals may vary from well behaving signals to random signals. It would be desirable if the dictionaries used exhibit some of the properties described in section 2.4.2. 42

3.2 Dictionary partitions Cosine packets The cosine packet dictionary is a m m dictionary of different frequency cosines as represented in the DCT equations (2.4). Figure 3.1: Cosine basis Mexican hat wavelets (Mexihat) This dictionary is made up of 2D translations of the basis shown in figure 3.2. Symmlets This dictionary contains 2D translations and rotations of the Symmlet-4 wavelet. Daubechies This dictionary contains 2D translations and rotations of the Daubechies-4 wavelet. 43

Figure 3.2: Mexican hat wavelet Figure 3.3: Symmlet 4 wavelet Gaussian pulse This dictionary is made up of pulses with different widths and different localizations of the pulse shown in figure 3.5. Sinc function This dictionary is made up of different dilations of the sinc function with different 2D translations. 44

Figure 3.4: Daubechies 4 wavelet Figure 3.5: Gauss pulse Geometrical This involves simple 2D planes at different positions and angles 45

Figure 3.6: 2D Sinc basis Figure 3.7: Geometrical basis 46

3.3 Results A random signal was used to evaluate the performance of the different dictionaries. Natarajan s algorithm (algorithm A.2) was applied. Results are shown in the table below. Dictionary PSNR Mexihat 5.06 Symmlets 14.81 Mexihat, Symmlets 14.87 Mexihat, db4 15.12 db4 15.23 Symmlets, Gauss 18.37 Mexihat, Symmlets, Gauss 18.37 Gauss 18.38 Mexihat, Gauss 18.41 db4, Symmlets 18.99 Mexihat, db4, Symmlets 19.00 Mexihat, db4, Symmlets, Gauss 19.04 db4, Symmlets, Gauss 19.04 Mexihat, db4, Gauss 19.42 db4, Gauss 19.42 Mexihat, Geometrical 19.80 Geometrical 20.94 Gauss, Geometrical 25.34 Mexihat, Gauss, Geometrical 25.43 Symmlets, Geometrical 25.64 Mexihat, Symmlets, Geometrical 25.94 Sinc 25.99 Mexihat, Sinc 26.06 Mexihat, db4, Geometrical 26.39 Symmlets, Gauss, Geometrical 26.43 47

Dictionary PSNR db4, Geometrical 26.46 Mexihat, Symmlets, Gauss, Geometrical 26.67 db4, Symmlets, Geometrical 26.77 db4, Gauss, Geometrical 26.85 Mexihat, db4, Symmlets, Geometrical 26.92 Mexihat, db4, Gauss, Geometrical 27.26 Cosine 27.35 db4, Symmlets, Gauss, Geometrical 27.38 Cosine, Mexihat 27.39 Mexihat, db4, Symmlets, Gauss, Geometrical 27.40 db4, Sinc 28.11 Symmlets, Gauss, Sinc 28.11 Cosine, Geometrical 28.15 Sinc, Geometrical 28.16 Mexihat, Sinc, Geometrical 28.18 Mexihat, db4, Sinc 28.18 Symmlets, Sinc 28.22 Mexihat, Symmlets, Gauss, Sinc 28.25 Cosine, Mexihat, Geometrical 28.28 Mexihat, Gauss, Sinc 28.37 Gauss, Sinc, 28.39 Mexihat, Symmlets, Sinc 28.47 db4, Symmlets, Sinc 28.49 Mexihat, db4, Symmlets, Sinc 28.61 db4, Gauss, Sinc 28.80 Cosine, Sinc 28.81 Mexihat, db4, Gauss, Sinc 28.82 db4, Symmlets, Gauss, Sinc 28.87 Cosine, Symmlets 28.93 48

Dictionary PSNR Cosine, Mexihat, Sinc 28.94 Mexihat, db4, Symmlets, Gauss, Sinc 28.95 Cosine, Mexihat, Symmlets 29.09 db4, Sinc, Geometrical 29.31 Mexihat, db4, Sinc, Geometrical 29.42 Cosine, db4, Symmlets 29.48 db4, Gauss, Sinc, Geometrical 29.52 Cosine, Gauss 29.53 Symmlets, Gauss, Sinc, Geometrical 29.54 Symmlets, Sinc, Geometrical 29.54 Mexihat, Symmlets, Sinc, Geometrical 29.54 Mexihat, db4, Gauss, Sinc, Geometrical 29.55 db4, Symmlets, Sinc, Geometrical 29.56 Cosine, Mexihat, Gauss 29.57 Mexihat, Gauss, Sinc, Geometrical 29.59 db4, Symmlets, Gauss, Sinc, Geometrical 29.59 Gauss, Sinc, Geometrical 29.60 Mexihat, Symmlets, Gauss, Sinc, Geometrical 29.61 Mexihat, db4, Symmlets, Gauss, Sinc, Geometrical 29.61 Cosine, db4, Sinc 29.62 Cosine, Mexihat, db4, Symmlets 29.65 Cosine, db4 29.67 Mexihat, db4, Symmlets, Sinc, Geometrical 29.68 Cosine, Mexihat, db4, Gauss, Geometrical 29.71 Cosine, Mexihat, db4, Sinc 29.71 Cosine, Mexihat, Gauss, Geometrical 29.71 Cosine, Sinc, Geometrical 29.71 Cosine, Mexihat, db4 29.72 Cosine, Mexihat, Sinc, Geometrical 29.73 49

Dictionary PSNR Cosine, Gauss, Geometrical 29.78 Cosine, Symmlets, Gauss 29.80 Cosine, Mexihat, Symmlets, Gauss 29.81 Cosine, db4, Gauss, Geometrical 29.82 Cosine, Mexihat, db4, Geometrical 29.87 Cosine, db4, Symmlets, Gauss 29.88 Cosine, Mexihat, db4, Symmlets, Gauss 29.88 Cosine, db4, Symmlets, Geometrical 29.88 Cosine, db4, Geometrical 29.89 Cosine, Symmlets, Geometrical 29.90 Cosine, Mexihat, Symmlets, Geometrical 29.92 Cosine, Mexihat, db4, Symmlets, Geometrical 29.95 Cosine, Symmlets, Sinc 30.00 Cosine, Mexihat, Symmlets, Sinc 30.16 Cosine, Mexihat, db4, Sinc, Geometrical 30.16 Cosine, db4, Sinc, Geometrical 30.19 Cosine, db4, Symmlets, Sinc 30.25 Cosine, db4, Symmlets, Gauss, Geometrical 30.25 Cosine, Mexihat, db4, Symmlets, Gauss, Geometrical 30.27 Cosine, Gauss, Sinc 30.27 Cosine, Mexihat, db4, Symmlets, Sinc 30.28 Cosine, Symmlets, Gauss, Sinc 30.29 Cosine, db4, Gauss 30.31 Cosine, Mexihat, db4, Gauss, Sinc 30.33 Cosine, db4, Gauss, Sinc 30.34 Cosine, Symmlets, Gauss, Geometrical 30.36 Cosine, Mexihat, Symmlets, Gauss, Geometrical 30.41 Cosine, db4, Gauss, Sinc, Geometrical 30.42 Cosine, Mexihat, db4, Gauss 30.43 50

Dictionary PSNR Cosine, Mexihat, Gauss, Sinc 30.45 Cosine, Mexihat, db4, Gauss, Sinc, Geometrical 30.46 Cosine, Mexihat, Symmlets, Gauss, Sinc 30.48 Cosine, Mexihat, db4, Symmlets, Gauss, Sinc 30.54 Cosine, db4, Symmlets, Gauss, Sinc 30.60 Cosine, Mexihat, db4, Symmlets, Sinc, Geometrical 30.72 Cosine, Gauss, Sinc, Geometrical 30.74 Cosine, Mexihat, Gauss, Sinc, Geometrical 30.75 Cosine, db4, Symmlets, Sinc, Geometrical 30.76 Cosine, db4, Symmlets, Gauss, Sinc, Geometrical 30.89 Cosine, Mexihat, db4, Symmlets, Gauss, Sinc, Geometrical 30.91 Cosine, Mexihat, Symmlets, Sinc, Geometrical 30.93 Cosine, Symmlets, Sinc, Geometrical 30.93 Cosine, Symmlets, Gauss, Sinc, Geometrical 30.99 Cosine, Mexihat, Symmlets, Gauss, Sinc, Geometrical 31.02 Table 3.1: PSNR values obtained for different dictionary combinations 3.4 Conclusion From the results we see that in general, increasing the number of basis in the dictionary increases the obtained PSNR. The DCT cosine basis gives a comparatively high performance alone and when augmented with other dictionaries, while the Mexican hat wavelets have the lowest performance. Some augmented dictionaries, even though they contain less sub-dictionaries than others, give an improved PSNR over others. It must be stated though, that this may be due to the nature of the random test signal used. This analysis may be enhanced by applying the same test on more than one signal. Signals of different nature should be examined, such as natural images, line art, speech... 51

Chapter 4 Matching Pursuit with Simulated Annealing 4.1 Simulated Annealing Simulated annealing [25] [24], [40] is a technique that has been used in search problems. It was originally adapted from the physical process of annealing. In such a process, physical substances (e.g. metal) are molten, or changed to a state of higher energy, then gradually cooled to get a solid state, or lower energy. It is desirable to reach a state of minimal energy; however, there is a probability that a transition to a higher energy state is made, given by the equation ρ = e E/kT (4.1) where E is the positive change in the energy level, T is the temperature, and k is Boltzmann s constant. Therefore, the probability of a large energy increase is lower than a smaller increase, and the probability also decreases as the temperature declines. The annealing process is very sensitive to the cooling rate, called the annealing schedule. A rapidly cooled substance will exhibit large solid stable regions (but not necessarily the lowest energy content, hence a local minimum), while a slower schedule will lead to a uniform crystalline structure, corresponding to the minimum energy content (or a global minimum). However, once we start obtaining the desired crystalline 52

structure, we don t want to waste time, so we can increase our annealing schedule rate. Reaching an optimal annealing schedule has no rules and is purely done using empirical approaches. 4.2 Subset selection from a large search space For large search spaces, where performing an exhaustive search is infeasible, we may revert to greedy algorithms. Greedy algorithms, such as best-first or steepest descent, choose the best solution element that minimizes the cost function from the current state. This element is added to the selected elements, thus forming a new state. The algorithm iterates until we reach a goal state, or the cost function cannot be decreased any more. In this essence, we say that greedy algorithms are sub-optimal. A major flaw of such algorithms is their tendency to reach a local minimum, rather than a global minimum (the optimal solution). Once the algorithm reaches a local minimum, any newly added selection will increase the cost function, and the algorithm terminates. Several enhancements are available that improve the performance of greedy algorithms. Simulated annealing is one of them, following the process of physical annealing. Simulated annealing allows the greedy algorithm, depending on certain conditions, to select a solution element that is not the best, or one that worsens the overall solution. This allows the algorithm to explore different areas of the search space to reduce the chances of falling into a local minimum. It is desirable that the probability of selecting the non-best element be higher at the early stages of the algorithm, and lower when the solution starts to converge. It is also desirable that this probability is adaptive, in the sense that the current state plays a role in calculating this probability. The main equation for simulated annealing is given by ρ = e E/T (4.2) where E is the change in energy, and T is the current annealing coefficient. After calculating ρ, we generate a random variable p[0, 1]. If the value of p 53

is greater than ρ, we perform an annealing step, or allow the algorithm to make a non-optimal selection. Otherwise, the algorithm proceeds in a normal greedy fashion. 4.3 Matching Pursuit with Simulated Annealing Image compression using overcomplete dictionaries is usually performed using greedy algorithms as described in section 2.5. This means that they are all sub-optimal. For a signal b R m and an overcomplete dictionary A R m n, m << n, determining the optimal basis functions that represent b is an NP-hard problem. An exhaustive search would require ( n p) iterations, which is prohibitive for the nature of the application! By using the algorithms described in section 2.5 we reduce this complexity, but may fall into a non optimal solution, or a local minimum. By using a technique based on simulated annealing, we allow the selection algorithm to explore a larger space. Following on the discussion in the previous section, we are essentially designing an algorithm that is the marriage between a greedy forward selection algorithm, a greedy backward selection algorithm, and the concept of simulated annealing. An outline of the algorithm is shown below. The forward selection step may be performed by any of the forward selection algorithms (Matching pursuit, orthogonal matching pursuit, order recursive matching pursuit,... ). The backward selection may be performed by any of the backward selection algorithms. We are left with defining the energy change criterion E and the annealing schedule T. 4.4 Algorithm requirements In order to obtain a proper algorithm for the solution of the subset selection problem, the encoding algorithm should posses several properties. 1. The algorithm should terminate (converge) in a finite number of steps 54

Algorithm 4.1 Subset selection using simulated annealing Initialization Γ {}, ξ Subset selection phase while ξ > ɛ do ρ exp E/T Generate random number α[0, 1] if α < ρ then Perform backward elimination algorithm else Perform forward selection algorithm end if Calculate new E Update T end while 2. The algorithm should provide a solution that is better than current selection algorithms (or in the worst case the same) in terms of the number of selected basis functions 3. The algorithm should provide a solution that is better than current selection algorithms (or in the worst case the same) in terms of the reconstructed signal quality 4. The output of the algorithm should be done in a way that minimizes the complexity of the decoder All of the forward selection algorithms discussed in section 2.6 are candidates as the selection method upon which simulated annealing may be applied. Since the purpose of this research is to investigate the effect of combining the heuristic simulated annealing method with a selection algorithm, the basic matching pursuit is of primary interest. Also, it is advantageous due to its simplicity and because it is of the lowest complexity. The other methods are modifications to the basic matching pursuit, therefore the effect of adding simulated annealing may be less evident than matching pursuit. They are also much more complex in terms of computation and implementation. For the rest of this chapter we will use the basic matching pursuit 55

algorithm as the forward selection algorithm, and we will use the primary concept of backward elimination algorithms for the backward step. 4.4.1 Inputs The inputs to the algorithm should be Signal b R m to be approximated. Dictionary A R m n, m << n that is full rank. The dictionary elements are desired, but not required to be affine in the sense of providing scale, rotation and translation invariance. Error tolerance ɛ for the reconstructed signal or Compression factor c which is the maximum percentage of coefficients used to the current signal m i.e. #coefficients = c 100 m (4.3) 4.4.2 E As described in section 4.1, we need to define a criterion for E. From equation 4.2, as E increases, the probability ρ decreases. We define F E as a function that calculates E. F E can be F (r) E = R(r) 2 (4.4) where F (r) E is the value of F E after iteration r, and R (r) 2 is the l 2 norm of the residual. Since the forward selection algorithm reduces the residual at each iteration, F E is a decreasing function 1 in the number of iterations, thus the probability of performing a backward selection increases. This may be regarded as a correction phase towards the end of the algorithm, where several 1 y 2 > 0 56

backward removal steps are made to eliminate bad choices and improve the results. However we need an annealing schedule that guarantees convergence of the algorithm. Another choice would be F (r) E = x(r) 2 (4.5) where x (r) 2 is the l 2 norm of the coefficient vector x. At each iteration, we add a coefficient to the coefficient vector, thus F E is an increasing function in the number of iterations. Since we need to explore the subspace early enough in the selection process, and less often at the end, every addition to the coefficient vector will increase the value of E, hence we decrease the probability of making a backward selection. This addition guarantees convergence of the algorithm. Similar to the above functions, we may define two more that take the difference of the generating functions at the current and previous iterations: F (r) E = R(r) 2 R (r 1) 2 (4.6) F (r) E = x(r) 2 x (r 1) 2 (4.7) The functions in equations 4.6 and 4.7 better represent the simulated annealing algorithm since they represent a change in the energy. The absolute value is taken because backward elimination steps will result in a negative change in energy. 4.4.3 Annealing schedule T From equation 4.2, an increase in T will increase the probability ρ until it saturates at a certain value. One possibility would be T = k s r (4.8) where k is a constant (> 0) input to the algorithm. Also, since we may be performing backward elimination, the number of currently selected basis s is 57

not necessarily equal to r. The more backward elimination steps we make, the smaller T will become, therefore the probability ρ will decrease, thus the algorithm will eventually terminate. 4.4.4 Initial Pursuit Rather than activating the simulated annealing procedure, we may opt to wait for a few forward iterations before starting any backward elimination. The number of forward selections should not be too large that the algorithm terminates before activating the simulated annealing schedule, and not too early to risk loosing the higher energy coefficients. It will also reduce the execution time since we will not be iterating forward and backward for a longer time. The parameter which we will call Initial Count should be a percentage of the required number of basis functions. 4.5 Parameter Simulation Several runs were made to select the E function, as well as to experiment with the different parameters k and Initial Count. Sixteen blocks along the diagonal of the standard Lena image 2 were chosen, and four main runs were executed: E given by equation (4.6) E given by equation (4.6) with saving the best result so far E given by equation (4.7) E given by equation (4.7) with saving the best result so far Each of these functions was tested for different compression ratios. For each compression ratio, different Initial Count values were tested, and different values for k, k = {0.1, 0.2,..., 0.9}, k = {1, 2,..., 10}. For each of these combinations, two runs were performed with different random number 2 See figure 5.1(a) 58

generator seeds, and the maximum and average PSNR for each combination was calculated. When using an update equation with saving the best result so far, the algorithm keeps track of the error after each iteration. If a backward elimination step, followed by a forward correction, results in a degraded performance, the algorithm discards this change, and reverts to the basis before this step. This means that the basis selected after a backward / forward iteration will only be changed if an improvement is made in terms of decreasing the error. Table 4.1 shows the different results of using the different E update methods. The values show the number of runs where the update method gives the maximum over all parameters (k, InitialCount) for each block. These values are summed up in the last row. It is evident from the simulations that not saving the best values gives a better performance than saving them. This may be due to the fact that the algorithm used reverts to the best saved basis so far if a slightly worse move is made, even though this extra exploration of the search space may lead to a more desirable minima on the long run. Letting the simulated annealing algorithm run naturally achieved better results. The table also shows that using update equation (4.6) gives a better performance. By fixing the E function to that of equation (4.6), we examined the effect of the parameters of the simulation. A detailed description of the algorithm in given in algorithm 4.2. For each combination pair {k,initialcount}, two runs were executed at different compression ratios. The number of runs where the parameter k achieved a higher PSNR than the MP algorithm is shown in table 4.1(a). The number of runs where the InitialCount parameter - as a percentage of the desired compression ratio - exceeded the MP algorithm is shown in table 4.1(b). Finally, a table and graph of the combined k and InitialCount pairs is given in table 4.3 and figure 4.1 respectively. From the results, we see that the best value for k is at k = 0.9. However, since the results are very close, it would be best to set k = 1. This will eliminate the extra multiplication required. The InitialCount variable gives the best results between 70% and 80%. 59

Block No. I II III IV 1 5 1 1 1 2 9 0 1 0 3 9 1 1 2 4 8 1 3 1 5 4 6 0 0 6 6 0 3 4 7 9 2 1 2 8 2 3 0 5 9 0 7 0 3 10 6 4 0 1 11 3 6 0 1 12 7 3 0 0 13 4 4 0 2 14 4 5 0 1 15 3 7 1 2 16 1 5 0 4 Sum 80 55 11 29 Table 4.1: Comparison of E update methods- I: equation (4.6), II: equation (4.6) with saving the best result, III:equation (4.7), IV:equation (4.7) with saving the best result k No. 0.1 812 0.2 1122 0.3 1248 0.4 1327 0.5 1395 0.6 1447 0.7 1487 0.8 1506 0.9 1545 (a) k 1 1512 2 1498 3 1451 4 1421 5 1413 6 1395 7 1383 8 1351 9 1348 10 1350 (b) InitialCount InitialCount No. 10 1922 20 2858 30 2357 40 3439 50 2201 60 3051 70 2537 80 3558 90 2054 100 2034 Table 4.2: Number of runs exceeding the PSNR of MP for different parameters 60

k InitialCount 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0.1 69 82 61 108 79 110 83 105 66 49 0.2 100 110 103 150 96 119 110 164 93 77 0.3 96 137 99 161 105 143 122 187 106 92 0.4 104 147 112 165 109 152 130 187 112 109 0.5 105 157 115 179 119 156 146 195 113 110 0.6 110 170 116 181 126 165 146 195 120 118 0.7 114 173 138 188 124 169 146 197 121 117 0.8 108 158 143 194 133 173 151 203 121 122 0.9 107 167 149 200 131 178 154 209 124 126 1 107 166 150 192 125 179 145 201 124 123 2 119 161 133 201 129 182 139 194 121 119 3 92 161 135 193 117 182 145 199 116 111 4 106 169 126 185 114 162 141 189 112 117 5 108 157 135 195 107 168 132 194 105 112 6 100 163 132 187 117 159 125 193 110 109 7 98 150 129 187 122 164 133 189 100 111 8 93 150 124 193 114 157 127 192 95 106 9 93 136 130 185 118 167 132 189 99 99 10 93 144 127 195 116 166 130 176 96 107 Table 4.3: Number of k, InitialCcount combinations exceeding the PSNR of the MP algorithm 61

Figure 4.1: Graph of results in Table 4.3 62

Algorithm 4.2 Matching Pursuit with Simulated Annealing Inputs Φ : Dictionary b R m 1 : Signal ɛ : Tolerance InitialCount : Number of iterations before simulated annealing MaxBasis : Maximum number of basis to select Initialization Γ {}, ξ r 0, dropped {}, s 0 Initial Matching Pursuit Phase while r < InitialCount and ξ > ɛ do select max k a k, R (r), k = {1, 2,..., n} x k a k, R (r) Γ Γ {k} R (r+1) R (r) a k, R (r) a k r r + 1 end while MPSA loop while ξ > ɛ and s < MaxBasis do ρ exp E/T Generate random number α[0, 1] if α < ρ then {Backward elimination} minerr for all i Γ do φ Φ(Γ i) temp φ b err b φtemp /m if err < minerr then index i minerr err end if end for if index dropped then dropped dropped index x index 0 s s 1 Γ Γ index R (r+1) b Φ(Γ)x end if else {Forward selection} select max k a k, R (r), k = {1, 2,..., n} x k a k, R (r) Γ Γ {k} R (r+1) R (r) a k, R (r) a k s s + 1 end if ξ = 1 m m i=1 Rr+1 i 2 E = R (r+1) R (r) T = ks/r r r + 1 end while 63

4.6 Modified Matching Pursuit with Simulated Annealing In this section we propose a modification to the MPSA algorithm. A major disadvantage of the matching pursuit algorithm (section 2.5.2) is that the residual is the orthogonal component to the projection of the newly added basis, but not necessarily to all the selected basis. This flaw was treated in the orthogonal matching pursuit (section 2.5.3). However, due to the added complexity of the orthogonalization step at each iteration, we will only perform this step after the algorithm terminates. x = (Γ T Γ) 1 Γ T b (4.9) This will ensure that the coefficients calculated correspond to the orthogonal projection of the signal on all selected basis functions, and there will be no components from one coefficient that intersect that of another. Quick test runs showed that this modification does indeed increase the PSNR of the reconstructed signal. Following up on the previous modification, and due to the fact that the matching pursuit algorithm is allowed to select the same basis more than once, the step in equation (4.9) may lead to several zero coefficients due to the rank deficiency in Γ. This means, that, for the same compression ratio, we can actually fit in more basis. To achieve this, the forward selection phase (in the initial pursuit or after simulated annealing activation), the algorithm was not allowed to chose a basis that has already been selected. If rank deficiency is still present (i.e. there are still some zero coefficients), more basis are selected until we reach the desired tolerance or desired number of basis. 4.7 Results Figure 4.2 shows the image Nature used in section 2.6 having undergone compression using the proposed MPSA algorithm using the update equation 64

F (r) E = R(r) 2 R (r 1) 2 (4.6). Figure 4.3 shows the same algorithm using update equation F (r) E = x(r) 2 x (r 1) 2 (4.7). The parameters were set to achieve the best results from the previous section, i.e. k = 1 and InitialCount = 0.7. This means that InitialCount was set to 70% of the number of basis to be selected. The modified algorithm (M-MPSA) was applied to the same images, and figure 4.4 uses the first update equation, while figure 4.5 uses the second update equation. A table similar to that presented in the results section 2.6 of the survey, with the proposed algorithms is given in table 4.7. A graph comparing between the proposed methods and the MP algorithm is also shown in figure 4.6. From the results, the MPSA algorithm with update equation (4.7) gives better results than that of update equation (4.6). This contradicts with the results in table 4.1. This may be due to the fact that the primary results were better suited for the blocks investigated in the parameter simulation. The M-MPSA algorithm provide superior results to the MPSA algorithm, and is comparable to ORMP and OMP and higher compression ratios. %Basis Greedy MP ORMP BP OMP I II III IV 5 24.47 24.24 24.47 19.28 24.41 24.24 24.40 24.24 24.40 10 28.09 27.44 28.09 21.69 27.93 27.44 27.75 27.44 27.75 15 30.78 29.79 30.78 23.19 30.52 29.79 30.21 29.82 30.24 20 33.23 31.83 33.23 24.58 32.87 31.83 32.39 31.92 32.44 25 36.45 34.32 36.45 26.25 35.92 34.32 35.10 34.54 35.25 30 38.96 36.14 38.96 27.45 38.25 36.14 37.11 36.53 37.37 35 41.65 37.90 41.65 28.65 40.71 37.94 39.14 38.48 39.53 40 44.57 39.64 44.57 29.79 43.34 39.69 41.14 40.36 41.64 45 47.74 41.37 47.74 30.95 46.16 41.43 43.24 42.22 43.84 50 52.54 43.65 52.54 32.56 50.26 43.74 46.01 44.60 46.78 55 56.61 45.35 56.61 33.85 53.63 45.46 48.12 46.29 48.85 60 61.21 47.05 61.21 35.21 57.40 47.24 50.36 48.10 51.21 65 66.47 48.76 66.47 36.67 61.62 49.01 52.61 49.72 53.36 70 72.63 50.45 72.63 38.24 66.34 50.85 54.92 51.42 55.68 75 82.80 52.72 82.80 40.60 73.63 53.33 58.40 53.70 58.97 80 92.57 54.41 92.57 42.51 80.17 55.22 61.11 55.34 61.55 85 105.46 56.10 105.46 44.61 88.16 57.03 64.21 57.07 64.46 90 124.11 57.78 124.11 47.03 98.46 58.84 67.38 58.82 67.68 95 155.91 59.46 155.91 49.70 112.37 60.69 71.76 60.57 71.80 Table 4.4: Comparison of forward selection algorithms with the proposed algorithms. I: MPSA with update equation (4.6), II: M-MPSA with update equation (4.6), III: MPSA with update equation (4.7), IV: M-MPSA with update equation (4.7) 65

(a) 5% of basis (b) 10% of basis (c) 50% of basis Figure 4.2: MPSA applied to the image Nature with update equation F (r) E = R(r) 2 R (r 1) 2 66

(a) 5% of basis (b) 10% of basis (c) 50% of basis Figure 4.3: MPSA applied to the image Nature with update equation F (r) E = x(r) 2 x (r 1) 2 67

(a) 5% of basis (b) 10% of basis (c) 50% of basis Figure 4.4: M-MPSA applied to the image Nature with update equation F (r) E = R(r) 2 R (r 1) 2 68

(a) 5% of basis (b) 10% of basis (c) 50% of basis Figure 4.5: M-MPSA applied to the image Nature with update equation F (r) E = x(r) 2 x (r 1) 2 69

Figure 4.6: Comparison of MP with the MPSA algorithms: Straight Line (MP), X (MPSA with update equation (4.6)), Circle (M-MPSA with update equation (4.6)), Triangle (MPSA with update equation (4.7)), Square (M- MPSA with update equation (4.7)) 70

Chapter 5 Results with Quantization and Comparing to the DCT In this section we compare between matching pursuit with simulated annealing, modified matching pursuit with simulated annealing, basic matching pursuit, and DCT. The proposed algorithms were run using the E equations in (4.6) and (4.7). without saving the best value and k = 1, InitialCount = 0.7. Four standard images (figure 5.1) were used. Color images were transformed into grayscale, and all images were resized using bicubic interpolation to 128x128 pixels. The images were processed on 8x8 blocks. For each method the DC component (or the first coefficient in the matching pursuit expansion) is differentially encoded since it contains most of the information, and minimum loss in these coefficients is desired. The first AC coefficient (AC 1 ) is rounded to the nearest integer, and the rest of the coefficients undergo uniform scalar quantization of 2 l, l = {4, 5, 6, 7} levels within the range 0 AC 1. This approach for quantization was mainly inspired by work in [18], [33], [20]. The first observation noticed from the bit rate vs. PSNR graphs is that increasing the number of bits per coefficient, l, increases the bit rate. However, for the same bit rate, the obtained PSNR also increases. This is due to the fact that we have more quantization levels, thus the distortion after reconstruction is reduced. This result applies for all test images. We also 71

notice that the proposed algorithms and the MP algorithm outperform the widely used DCT. The performance of the MP algorithm and the MPSA algorithm with both update equations is relatively similar, where the MPSA algorithm either matches the PSNR of MP, or gives a slight gain in performance. The M-MPSA algorithms achieve the best quality, but both update equations are comparable. In the analysis we provide a closer view of performance around the 1 bpp bit rate. The MPSA algorithm with update equation F (r) E = R(r) 2 R (r 1) 2 (4.6) is almost exactly the same as the MP algorithm, while using update equation F (r) E = x(r) 2 x (r 1) 2 (4.7) gives a slightly higher quality, which is inconsistent with the results obtained in section 4.7. The similar performance is due to the fact that at such low bit rates, the simulated annealing heuristics are not sufficiently applied, and almost the same basis as the greedy MP algorithm are used. For the Test Pattern image, the MPSA algorithm actually gives a degraded performance from the MP algorithm, possibly due to the fact that the uniform blocks in the image that require a very small number of basis, hence an averaging basis may be eliminated in a backward elimination step. At higher bit rates the added number of basis overcomes this problem. As for the M-MPSA algorithm, is clearly outperforms other methods for all images, especially at the higher bit rates. At 1 bpp, the M-MPSA update equation (4.6) gives a higher performance for the image Lena, while for the rest of the images, update equation (4.7) is better. The lowest performance over the MP algorithm around the 1 bpp bit rate is for the Test Pattern image, which gives an enhancement of approximately 0.3 db. For the other images we get an improvement of up to approximately 0.7 db. Improvement of the M-MPSA over the DCT algorithm ranges from 1.5 db up to over 3 db all around the 1 bpp, and even greater improvement for higher bit rates. The improvement in PSNR quality is achieved at the price of higher complexity and an increased number of computations. Figures 5.3, 5.5, 5.7 and 5.9 give the number of dot products versus the bit rate. A closer view is also presented for the number of basis corresponding to approximately 1 bpp after quantization, l = 4. The number of dot products increases linearly with the required compression ratio for the DCT and MP algorithms. The 72

MP algorithm requires approximately 5% to 10% more computation than the DCT algorithm. The MPSA and M-MPSA algorithms using update equation (4.6) require slightly more computations than the MP algorithm at low bit rates. The number of computations increases at higher bit rates, but also gives better quality. Update equation (4.6) requires less computations than the update equation (4.7). This shows that the second method performs more backward elimination steps than the first. Backward elimination steps described in section 2.5.6 are indeed computationally intensive, since each basis is eliminated one at a time, and a least square error calculation is performed. However, these computations may greatly be reduced by using incremental updates and enhanced numerical methods as described in [39] and [2] 73

(a) Lena (b) Peppers (c) Boat (d) Test Pattern Figure 5.1: Standard test images 74