Line Based, Reduced Memory, Wavelet Image. Christos Chrysas and Antonio Ortega. Integrated Media Systems Center.

Line Based, Reduced Memory, Wavelet Image Compression Christos Chrysas and Antonio Ortega Integrated Media Systems Center University of Southern California Los Angeles, CA 90089-2564 chrysafi,ortega@sipi.usc.edu Tel: 213-740-2320, Fax: 213-740-4651 Abstract In this work we propose a novel algorithm for wavelet based image compression with very low memory requirements. The wavelet transform is performed progressively and we only require that a reduced number of lines from the original image be stored at any given time. The result of the wavelet transform is the same as if we were operating on the whole image, the only dierence being that the coecients of dierent subbands are generated in an interleaved fashion. We begin encoding the (interleaved) wavelet coecients as soon as they become available. We classify each new coecient in one of several classes, each corresponding to a dierent probability model, with the models being adapted on the y for each image. Our scheme is fully backward adaptive and it relies only on coecients that have already been transmitted. Our experiments demonstrate that our coder is still very competitive with respect to similar state of the art coders, such as [1, 2]. Note that schemes based on zerotrees or bitplane encoding basically require the whole image to be transformed (or else have to be implemented using tiling). The features of the algorithm make it well suited for a low memory mode coding within the emerging JPEG2000 standard. 1 Introduction One of the main reasons to use linear transforms (wavelets, DCT, etc) in image coding is the removal of the existing correlation between neighboring pixels. While wavelet transforms, as demonstrated by recent tests within the JPEG 2000 standardization process, seem to have a somewhat better performance (both due to the transform and This work was supported in part by the National Science Foundation under grant MIP-9502227 (CAREER), the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, the Annenberg Center for Communication at the University of Southern California, the California Trade and Commerce Agency, and by Texas Instruments.

to the better data structures they enable), traditionally the Discrete Cosine Transform (DCT) has been the most widely used in image and video coding applications. A number of reasons explain the continued popularity of the DCT. In particular, very ecient implementations have been studied for a number of years [3] and its block based operation makes it easy to parallelize the computation. Another important reason for the DCT dominance is the fact that DCT-based schemes do not have very high memory requirements. Eectively, a JPEG coder can operate having a single (or a few) 8x8 pixel blocks in memory at any given time. By comparison, ecient implementations of the wavelet transform have received less attention. The coding results for wavelet based coders have recently outperformed DCT based coders but still none of the proposed algorithms can operate in reduced memory mode except by tiling the input image and running the encoder separately on each tile. While tiling is attractive in its simplicity, it also has the drawback ofintroducing blocking artifacts, as well as reducing the coding eciency if the tiles are small. Our motivation in this paper is to study wavelet based coders which enable a low memory implementation and to demonstrate that this mode of operation can be supported without signicant loss in coding performance. For example DCT implementations may require that only a stripe of the image be buered, where the size of the stripe is typically eight lines. If the image data is received by the encoder line by line, the size of the buer needed is only 8 X where X is the width of the image and thus memory requirements increase only with the width of the image (rather than the total size). This is very attractive for the compression of very large images acquired from scanners, and for the printout of images, as needed for example in the facsimile industry [4]. Our goal is to design wavelet coders with similar characteristics in terms of memory. The memory requirements of all state of the art wavelet based algorithms [1, 2, 5] are typically of the order of the size of the input image (or the input tile, if tiling is performed). The whole image is buered in memory, awavelet transform is computed and then manipulation of the wavelet coecients is done based on the assumption that the encoder has access at any given time to every pixel of the original or the transformed image 1. Current algorithms can operate with reduced memory through tiling, but obviously this will also result in a degradation of the coding performance. Our proposed algorithm operates in one pass and it requires only a small portion of the image to be available to start coding. Its eciency for compression can be compared to other state of the art algorithms such as [1, 2, 5], with dierences in performance of around 0:5dB. The price to be paid at this point is the loss of the embedding property, but clearly that should not be a factor if memory at the encoder or decoder is a premium. The algorithm is based on a progressive implementation of a generic wavelet transform. Instead of performing, as is usually done, a wavelet transform on all rows and then on all columns, we propose to compute the transform one line at a time. When a new line is received, the wavelet transform in the horizontal direction can be computed right away. Then, when a sucient number of lines (given the lter size and 1 It is obvious that the encoder needs access to all the data for bitplane based approaches [5, 6]. However, even in other methods, which do not involve bitplane coding, the encoder may utilize global information from individual bands or the whole image [1, 2].

desired number of decompositions in the vertical direction) has been received we can start computing the wavelet transform on the vertical direction. At this point we start generating coecients from each of the wavelet subbands in an interleaved fashion. This not only allows limited memory operation it can also lead to adding other features. For example lters can be changed for each line in the image. Also, it may be possible to detect that certain lines have characteristics dierent from those of natural images (for example in a compound document some regions will contain scanned text) and the ltering itself may be turned o, so that grayscale regions and bilevel regions are treated dierently as in [7]. As soon as wavelet coecients are computed they are quantized with a simple uniform quantizer and transmitted. Context based adaptive entropy coding analogous to the one used in [1] is implemented, with the only major dierence being that the coder does not have access to any global information about the image or the subbands, and thus several adjustments have to be made. The paper is organized as follows, in Section 2 we describe the details of our low memory implementation of the wavelet transform, and analyze the memory requirements of the algorithm. The encoding process is described in Section 3, where we also discuss the adjustments made to accommodate a low memory operation. Coding results are given in Section 4, along with conclusions and ideas for future extensions. 2 Line-by-Line Wavelet Transform Implementation Consider an image of size Y X where X is the size in pixels of each line and Y is the number of lines. We assume that the image is received by the encoder line by line. In what follows we indicate all the buer sizes in terms of number of lines (i.e. a buer of size B can store B lines). The basic idea is to perform part of the transformation after each line is received. In this paper we consider a dyadic wavelet decomposition with 5 levels along the horizontal direction. The lters used are of length 7 and 9 for the high pass and the low pass analysis lters respectively. We denote the maximum lter length by L and we only consider odd length lters, L =2S+ 1. Symmetric extension is used on the boundaries of each line, so no extra high frequencies are introduced. The one dimensional transform can be implemented eciently without signicant memory overhead. The novelty of our work comes from the introduction of a progressive process to perform the vertical ltering in the wavelet transform. Refer to gure 1 and let i be the i-th input line. When each line is received, ltering along the horizontal direction is performed rst (in our case with ve levels of decomposition) and the resulting wavelet coecients are stored in memory. At the beginning of the coding process, we receive input lines, perform the horizontal decomposition and store the results in a buer of size L = 9. Then when the S+1 = 5 th line is received we can perform a symmetric extension inside the buer, so the buer is full, i.e., it contains L lines and we can start ltering in the vertical direction since the length of the lters is nine 2. For one level of decomposition the rst output line 2 Note that this symmetric extension is used only at the beginning and at the end of the image, as is the case for the usual implementation of the wavelet transform

will have low pass information and the second will have high pass information. When using more than one level of decomposition we will have more than two subbands. In general if we use N levels of decomposition, the lines that are multiples of2 N have low pass information in the vertical direction, for the remaining lines the following are true: line i belongs to level n if it is multiple of 2 n but not multiple of 2 n+1,we start counting from zero for the number of levels, viz: level 0; 1; 2;:::; N,1, (refer to gure 1 for details of a three level decomposition). 7S - 3S - S - Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Segment 6 Segment 7 Segment 8 Segment 9 Segment 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39,????,??????? Queue 2 \ \ Queue 1,, @I @ Level 1 @I @ Level 2 @I @ Level 3 \?? \?,, Output lines - Low pass rst level line - Low pass second level line - High pass rst level line?- High pass second level line - High pass third level line \- Low pass third level line Figure 1: Wavelet decomposition with three levels. Each i represents an input line and 0 is the rst line in the image. Total memory needed is 43 lines, the height of the image does not aect the memory size needed for the implementation of the forward wavelet transform. The lter sizes are 7 and 9 tap for high pass and low pass channels. The horizontal position for each block (line) represents the time at which it is available. Thus we can see that the rst output line line is produced at time 28. We do not consider the memory needed for context modeling in this gure. The three buers of size nine are for ltering, and the two queues of size 4 and 12 are used for buering the high pass lines, for levels 1 and 0 respectively, so that they can be interleaved at the output and memory at the decoder can be reduced. Assume we have performed a one level decomposition in an image, half of the output lines will have low pass information and the other half will have high pass information in the vertical direction. If we want to continue with a second level of decomposition, the low pass lines from the rst level need to go in an other buer of size L so we will need to wait till this new buer is full before we start ltering. There is a clear trade-o between the number of levels of decomposition in the vertical direction and the memory requirements. Adding a second level of decomposition requires an output buer (queue) to store the high pass from the rst level. The high pass lines from the rst level are made available as soon as we start ltering for the rst level, but this is not the case for lines from higher levels. We need to wait till we start ltering in the higher levels, so if we transmit lines as soon as they are made available we will end up with segments of consecutive lines from the same band. There is a need for

an output queue at the encoder side as well as the decoder side, for the sole purpose of synchronization of the lines. We can skip the output queue for the high pass lines and transmit every line the rst time it is made available, but this will give rise to further buering on the decoder side. Our strategy tries to keep encoder and decoder symmetric in terms of the wavelet transform, and the buer sizes. From this point on wekeep reading new lines, perform horizontal and vertical transformation on the input lines, and send output lines to the quantizer side of the encoder. It is important to note that for a given choice of transform (lters and levels of decomposition) our approach yields the same transform coecients as the usual method which is based on separable ltering of the complete image The only dierence here is the reduced memory requirements and the fact that the wavelet coecients are interleaved, rather than grouped by subband. This will condition the way the encoder performs quantization and entropy coding, as will be discussed in Section 3. The memory requirements of the algorithm depend on the length L, of the lters, the width X, of the image and also the levels N, in the decomposition. Suppose the lter length is L = 9. In this case we need to buer nine lines of the image, in order to perform a one level decomposition in the vertical direction. Suppose we want togo further to a second level of decomposition, we will then need two buers of size nine, one for each level of decomposition. So the buer size for ltering, T F, is proportional to the number of levels N in the decomposition, T F = N(2S +1)=NL. If we take into account both encoder and decoder buers, since there is some delay between the acquisition of the high pass and low pass data, we also need another buer of size S(2 n,1) for each level after the rst in both encoder and decoder, P where n is the N,1 level number. For N levels the total buer (queue) size will be T D = n=0,1) = S(2n S(2 N, N, 1) X - 6 Y? Past Image Data???? Pixels in main memory 6 6 6 6 Data to be Encoded Levels T F T D T C T 0 0 0 2 2 1 L 0 4 5+3S 2 2L S 6 7+5S 3 3L 4S 8 9+10S 4 4L 11S 10 11 + 19S 5 5L 26S 12 13 + 36S 6 6L 57S 14 15 + 69S Figure 2: Buering strategy in the proposed scheme. The table on the right gives the exact number of lines needed for each stage of the algorithm, namely for the wavelet transform, T F, the context modeling, T C and the buering needed for synchronization T D, in the case of odd length lters, L =2S+1. T =T F +T D +T C is the total buer size. Note that the case of zero levels of decomposition would be equivalent to a context adaptive coder such as JPEG-LS

31 Goldhill at 0.25b/p Best algorithm 36 Bike 30.5 34 30 32 This work PSNR 29.5 With DPCM at the Low pass PSNR 30 28 29 Without DPCM at the Low pass 26 28.5 24 Reference [1] with tiling 28 1 2 3 4 5 6 7 Number of levels (a) 22 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Bit rate (b) Figure 3: (a) PSNR for the image Goldhill, versus number of levels of decomposition in the vertical direction. The bit rate is 0:25b/p. We present results both with and without DPCM in the low pass subband. 4 levels of decomposition results in the best performance, and DPCM does improve especially when few levels of decomposition are used. We are about 0:5dB o the best algorithm without memory constrains (top line). (b) comparison between this algorithm and reference [1] using tiling for the image Bike. The memory size is 87 lines for our algorithm and the tile size for [1] is 128. Additional buers are needed for context selection as described in section 3, we need two lines P of buering for each subband, for the case of N level decomposition we need N T C = 2=2N+2lines. i=0 T F ;T D ;T C together give us the total memory size T needed for both wavelet decomposition and coding. T = T F +T D +T C = S(2 N +N,1)+3N+2. Figure 2 gives us the increase in the memory size needed, with respect to the number of levels in the decomposition. Generally speaking more levels of decomposition tend to result in better performance, up to some limit as seen in gure 3(a). The wavelet transform in the horizontal direction does not require any extra buering, and thus several levels of decomposition can be accommodated without additional memory requirements. However ltering in the vertical direction does require some additional buering if we assume that data are made available on a line by line basis. Thus if memory is a very strict constraint it might be necessary to limit the levels of vertical decomposition to say, two or three. The performance with only two levels of vertical decomposition may be reasonable if some form of decorrelation is applied to the low pass subband (for example a simple DPCM algorithm). Also if we want to use many levels, and we want toavoid the increased memory requirements, we can change the length of the lters in each level. 3 Coding From the coding point of view, a drawback in our approach is the fact that we do not have access to global information, e.g. the whole transformed image or even a complete subband, since all the subbands are generated progressively. Thus we cannot make use of global statistics as for example in [1], and can only resort to online adaptation. We

tackle this problem by using a single uniform quantizer in all the subbands, for all the coecients of the decomposition. Our quantization is thus extremely simple. To achieve some coding gain we resort to the powerful tool of context modeling. Based on a few past coecients within the same subband 3,we classify each new coecient into one of several classes. The number of classes is predened and the classes are the same in all subbands. Using the same classes in all the subbands is helpful in adapting the models in each class very fast. Even though using the same classes throughout the image seems a very simplistic scheme, the results are very encouraging. Intuitively,we characterize the distributions conditioned on a particular type of neighborhood (e.g., high energy vs low energy) and the resulting models seem to be very similar from subband to subband (i.e. the same probability distribution can be expected if the neighborhood has high energy, regardless of the subband.) For each subband we need to keep in memory the past two lines for use in context modeling, as seen in gure 4. Thus when we encode a certain coecient we also have access to past neighbors from the P same band and these neighbors are used for context M,1 modeling as follows: Let ^y = i=0 i j^x i j, where all the weights i are xed and are inversely proportional to the distance between ^x i and our current coecient. The ^x i 's are the past quantized coecients in the same band. Based on the statistic ^y we classify the coecient to be transmitted x, into one of several classes. To each class there corresponds a certain probability model which is used by an arithmetic coder. All these models are kept up to date in both encoder and decoder based on the transmitted data and no side information needs to be transmitted. The problem of classication based on the values of the statistic ^y is also very important inthe algorithm's performance. We use the technique introduced in [1], which quantizes ^y with quantization intervals increasing exponentially as we go further away from zero. Values of ^y = 0 lead to a special class. This is very important since after quantization it is highly likely that there will be many zeros in smooth areas, especially if the wavelet transform did a good job in de-correlating the input data. The number of dierent classes used was xed to M = 12. n - n + d - n +2d - -Past pixels used for modeling -Past pixels not in use -Next pixels -Current pixel Figure 4: Context modeling. For each subband in the vertical direction, we have a memory of two lines. Thus based on the previous 2 lines and the past data in the same line we compute a number, ^y. This number will help us determine the class to which our current coecient belongs. Notice that the spacing d depends on the subband, and all the coecients in the gure belong to the same subband. Due to the lack of global information it is no longer possible (as required in [1]) to 3 Coecients from each subband are generated in raster scan order, so in any given band all coecients prior to the current one are known, but only the neighboring ones need to be stored.

know the maximum and minimum values of the wavelet coecients in each class, or in each subband. One approach is to assume the maximum possible dynamic range, and use large enough models i.e. models with suciently large number of bins. In this way we have a large enough dynamic range to cover all the values that need to be encoded but adaptation is likely to be slow. The other approach is to use escape symbols, i.e. we have a special symbol to indicate we have exceeded the maximum possible value. If this is the case we follow the transmission of the special symbol by the transmission of the value in excess of the maximum value in the codeword. In other words we use a small alphabet which includes an escape symbol, if our coecient belongs to this small alphabet we do transmit it, if not we transmit the escape symbol, and we then transmit the coecient using dierent codebook. For example, let us say wewant to transmit values in the range 0 to 1023, i.e. the size of the codebook is 1024. We can instead use a codebook of size 9 = 8 + 1 with eight symbols corresponding to some value and one representing the escape character. If our wavelet coecient happens to be represented by anumber in the range 0, 7we transmit this number, if not we transmit 8 and we follow with the transmission of the coecient. Since wavelet coecients tend to be very small it is unlikely that we will have to transmit the escape character very often. Adapting the probabilities for the \small" codebook is important and can be done fast. The codebook of larger size will be used less frequently since it represents most of the data that will be transmitted, and in fact the symbols could even be sent without entropy coding, without much compromise in performance. There is a question as to how much we should \shrink" the codebook, or what portion of the initial codebook we should use. We havetwo conicting requirements, we do not want to transmit the escape character too often but we also do not want tohave a large codebook for the rst transmission. We can associate a boundary to the class where each coecient falls according to the context modeling, classes with high energy need the boundary shifted to the right, classes with small energy (lots of zeros) need the boundary shifted to the left. We need to select the boundary in a way that the probability of a coecient falling above this bound is very small while at the same time the range on the left hand side is small enough, in order to gain in terms of coding. Our scheme of using two codebooks and an escape symbol has certain similarities to the Golomb/Rice encoding and thus the threshold selection can be formalized in a similar manner[8]. Also the structure of the coder makes it suitable for the use of Golomb/Rice codes, which will result in signicant complexity reductions. Further study in this direction is under way. The majority of the wavelet coecients will be zero, depending on the bit rate at whichwe are working. Whenever we run across a zero weenter run length mode and we encode the length of the run of zeros, this speeds up the coder, instead of encoding for example 512 zeros we send the length of the run of zeros encountered. For the entropy coding of the length of the runs of zeros we do not use any context modeling, instead we use an adaptive arithmetic coder. By experiments we have found that context modeling does not help in entropy coding the length of runs of zeros. However at very low rates the introduction of run length coding gives us a signicant speed advantage.

4 Experimental results Image Rate EZW [6] SPIHT[5] SFQ[9] C/B [1] EQ [2] Line Based Barbara 0.20-26.64 26.26 27.09-26.67 512 512 0.25 26.77 27.57 27.2 28.38-27.69 0.50 30.53 31.39 31.33 32.22-31.45 1.00 35.14 36.41 36.96 37.48-36.26 Lena 0.20-33.16 33.32 33.24 33.57 32.09 512 512 0.25 33.17 34.13 34.33 34.45 34.57 33.15 0.50 36.28 37.24 37.36 37.59 37.68 36.52 1.00 39.55 40.45 40.52 40.91 40.88 39.89 Goldhill 0.20-29.84 29.86 29.90 30.04 29.48 512 512 0.25-30.55 30.71 30.77 30.76 30.24 0.50-33.12 33.37 33.43 33.42 32.86 1.00-36.54 36.70 36.98 36.96 36.28 Bike 0.20-28.04-28.21-27.70 2560 2048 0.25-29.12-29.39-28.76 0.50-33.00-33.40-32.54 1.00-37.69-38.26-37.08 Table 1: Comparison between our method and [6, 5, 9, 1, 2] for images: Barbara, Lena, Goldhill and Bike, the last image is part of the test images for JPEG2000. We used ve levels of vertical decomposition with 7-9 tap. The results for reference [2] correspond to 10-18 tap lters. Our results are always better than baseline JPEG, and outperform in some cases the zero-tree algorithms in their basic form, but they cannot outperform schemes such as [1]. Table 1 is given for comparisons. In gure 3(b) we compare the algorithm in this paper with the one in [1] using tiling for the image Bike. Our algorithm requires 87 lines of buering while the tile size for the algorithm in [1] is 128. Both algorithms require almost the same amount of main memory but our algorithm is much faster since it does not require any rate distortion selection, (the lters used are again 7-9 tap biorthogonal in both cases). Our current algorithm outperforms the one in [1] at low rates, at higher rates the results are not as good but are still competitive, the reason for the dependence of the performance on the rate is that in this work we do not consider any kind of optimization in quantization or entropy coding, our objective is speed and simplicity. Also our algorithm does not introduce any blocking artifacts, since not tiling is used. Note that we do not use any sophisticated quantizer, we do not perform any kind of rate distortion encoding and there is no training involved in any stage of the design of the algorithm, the only parameter of the algorithm is the quantization step size. By varying the quantization step size we have control over the PSNR,

PSNR 10 log 2 4. The algorithm demonstrates that serial processing of the 12MAX 2 image data mightbe aninteresting alternative to other existing algorithms which require the buering of the whole image. The coder does not outperform the state of the art coders, but is competitive at a fraction of the required memory. It can be proposed as a mode in the new standard JPEG2000 for the compression of large images, with low memory requirements. References [1] C. Chrysas and A. Ortega, \Ecient Context-based Entropy Coding for Lossy Wavelet Image Compression.," in Proc. IEEE Data Compression Conference, (Snowbird, Utah), pp. 241{250, IEEE Computer Society Press, Los Alamitos, California, 1997. [2] S. M. LoPresto, K. Ramchadran, and M. T. Orchard, \Image Coding based on Mixture Modeling of Wavelet Coecients and a Fast Estimation-Quantization Framework.," in Proc. IEEE Data Compression Conference, (Snowbird, Utah), pp. 221{ 230, IEEE Computer Society Press, Los Alamitos, California, 1997. [3] W. Pennebaker and J. Mitchell, JPEG Still Image Data Compression Standard. Van Nostrand Reinhold, 1994. [4] ITU-T.4, Standardization of Group 3 Fascimile Apparatus for Document Transmission, ITU-T Recommendation T.4. ITU, 1993. [5] A. Said and W. Pearlman, \A New Fast and Ecient Image Coder Based on Set Partitioning on Hierarchical Trees," IEEE Trans. Circuits and Systems for Video Technology, vol. 6, pp. 243{250, June 1996. [6] J. M. Shapiro, \Embedded Image Coding Using Zerotrees of Wavelet Coecients," IEEE Trans. Signal Processing, vol. 41, pp. 3445{3462, December 1993. [7] J. Liang, C. Chrysas, A. Ortega, Y. Yoo, K. Ramchandran, and X. Yang, \The Predictive Embedded Zerotree Wavelet (PEZW) Coder, a Highly Scalable Image Coder for Multimedia Applications, Proposal for JPEG 2000," ISO/IEC JTC/SC29/WG1N680 Document, Sydney, November 1997. [8] M. Weinberger, G. Seroussi, and G. Sapiro, \Loco-I: A low complexity, contextbased, lossless image compression algorithm," in Proc. IEEE Data Compression Conference, (Snowbird, Utah), pp. 140{149, IEEE Computer Society Press, Los Alamitos, California, 1996. [9] Z. Xiong and K. Ramchandran and M. T. Orchard, \Space-frequency Quantization for Wavelet Image Coding.," IEEE Trans. Image Processing, vol. 6, pp. 677{693, May 1997. 4 MAX is the maximum possible value in the image data, and 2 uniform quantizers 12 is the MSE for high resolution