946 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL /$ IEEE

Size: px

Start display at page:

Download "946 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL /$ IEEE"

Kristian Douglas
5 years ago
Views:

1 946 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Compress Compound Images in H.264/MPGE-4 AVC by Exploiting Spatial Correlation Cuiling Lan, Guangming Shi, Member, IEEE, and Feng Wu, Senior Member, IEEE Abstract Compound images are a combination of text, graphics and natural image. They present strong anisotropic features, especially on the text and graphics parts. These anisotropic features often render conventional compression inefficient. Thus, this paper proposes a novel coding scheme from the H.264 intraframe coding. In the scheme, two new intramodes are developed to better exploit spatial correlation in compound images. The first is the residual scalar quantization (RSQ) mode, where intrapredicted residues are directly quantized and coded without transform. The second is the base colors and index map (BCIM) mode that can be viewed as an adaptive color quantization. In this mode, an image block is represented by several representative colors, referred to as base colors, and an index map to compress. Every block selects its coding mode from two new modes and the previous intramodes in H.264 by rate-distortion optimization (RDO). Experimental results show that the proposed scheme improves the coding efficiency even more than 10 db at most bit rates for compound images and keeps a comparable efficient performance to H.264 for natural images. Index Terms Base colors and the index map, compound image compression, dynamic programming, residual scalar quantization. I. INTRODUCTION B ESIDES natural images, there are millions of artificial visual contents generated by computers every day, such as Web pages, PDF files, slides, online games, and captured screens. They are usually a combination of text, graphics, and natural image. This is why they are generally called as compound images. In particular, with cloud computing becoming more and more popular [1], compound images often need to be displayed on remote clients, wireless projectors, and thin clients. Some of the clients are unable to directly render them from files. Compressing and transmitting compound images provides a generic solution to these clients. However, in this solution, how to efficiently compress compound images has become a prevalent and critical problem. The state-of-the-art image compression standards (e.g., JPEG [2], JPEG2000 [3] and the intraframe coding of H.264 [4]) are Manuscript received January 20, 2009; revised October 13, First published December 15, 2009; current version published March 17, This work was done while C. Lan was with Microsoft Research Asia, Beijing, , China. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Oscar C. Au. C. Lan and G. Shi are with the Department of Electrical Engineering, Xidian University, Xi an, , China ( lancuiling@see.xidian.edu.cn; gmshi@xidian.edu.cn). F. Wu is with Microsoft Research Asia, Beijing, , China ( fengwu@microsoft.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP all designed for natural images. The correlation among samples is mainly exploited by transforms (e.g., 2-D DCT or wavelet). Because of the complexity issue, 2-D transform is implemented by two 1-D transforms. Such separate implementation cannot handle anisotropic correlation rather than the horizontal and vertical. Thus, approaches are proposed to efficiently exploit the anisotropic correlation among samples. One category of these approaches is to perform directional prediction before transform as H.264 intraframe coding does [5]. After removing the directional correlation, prediction residues can be assumed as isotropic again. Another category is to apply directional transforms, such as directional wavelet [6], [7] and directional DCT [8], [9], which incorporate directional information into transforms, to reduce the high-frequency energy in the transform domain. They significantly improve the coding performance on natural images with rich anisotropic correlations. However, all of these approaches are not efficient enough when compressing compound images. Taking the extreme anisotropic features of text and graphics into account, some schemes are proposed for compound image coding. The general ideas can be categorized into three groups. Image-coding-based approaches They adopt conventional image coding schemes but improve the bit allocation between text/graphics and natural image areas because the text/graphics areas are often blurred after compression [10], [11]. Thus, the quantization steps in text/graphics areas are decreased and more bits are allocated to them. For a fixed bit budget, it would correspondingly decrease bits for the coding of natural image areas. Consequently, the overall quality after compression is still not good. Layer-based approaches They adopt the mixed raster content (MRC) image model for compression [12] [17], where one compound image is decomposed into a foreground layer, a background layer, and a binary mask plane at block or image level. The mask plane indicates which layer each pixel belongs to and can be compressed by mature binary coding schemes, such as JBIG and JBIG2. The foreground and background layers are smoothed by data filling algorithms and then compressed by conventional image coding schemes. A detailed survey on the MRC schemes has been reported in [15]. It demonstrates significant gains over conventional image coding schemes. However, there are several drawbacks in the approaches. First, the performance is greatly influenced by segmentation. Block-threshold and rate-distortion optimized methods are proposed to optimize the segmentation in [16] and [17]. Second, without special processing, the holes resulted from segmentation will deteriorate the /$ IEEE

2 LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 947 coding performance. For that, iterative block filling in spatial or transform domain is developed in [18] [20]. Third, separately coding text colors in the foreground layer and text shapes in the mask plane will also hurt the coding performance. A shape primitive extraction and coding (SPEC) scheme is proposed in [21], where shape primitives containing both colors and shapes are losslessly compressed by a combined shape-based and palette-based coding algorithm. Block-based approaches They first classify blocks in compound images into different types according to their spatial properties [22]. Image features, such as histogram, gradient and the number of colors, are often used for classification [23], [24]. Then different type blocks are compressed by different coding schemes to better adapt to their statistical properties [23] [25]. Considering the sparse histogram distribution of colors in text/graphics blocks, a novel method is proposed to represent a text/graphics block by several base colors and an index map. Furthermore, Ding et al. develop this method as an intracoding mode and incorporate it into H.264 [26]. Thus, the coding performance of that scheme on compressing compound images is significantly improved. The proposed scheme in this paper adopts the block-based architecture as a basis. However, our focus is how to better exploit the spatial correlation in compound images without transform. H.264 intraframe coding is taken as the benchmark to develop our scheme, where the mode-based design and the ratedistortion optimization mode selection provide an easy way to combine spatial-domain and transform-domain methods. Our main contribution is to develop a comprehensive and systematic coding scheme by fully taking the properties of compound images into account. In our scheme, two new intramodes (RSQ and BCIM) are proposed to exploit the spatial correlation in compound images. In the RSQ mode, intraprediction residues are directly scalar quantized and coded by CABAC without transform. The idea of directly compressing intraprediction residues has been reported in [27] for lossless coding of natural images. In this paper, we will study the lossy case that is suitable for compound images and how to select the quantization in spatial domain. In the BCIM mode, the conversion from an image block to several base colors and an index map is extended to different size blocks to adapt to the nonstationary properties of compound images. Moreover, the rate cost is considered in the proposed BCIM mode and the number of base colors is determined in the sense of rate-distortion optimization. Following the coding of base colors, each index in the map is coded by context-based entropy coding. With the two new modes and the RDO mode selection method, the proposed scheme significantly outperforms all existing schemes on compressing compound images; meanwhile, it keeps the comparable performance to H.264 on coding natural images. The rest of this paper is organized as follows. Section II gives a brief overview of the proposed coding scheme. The two new intracoding modes are discussed in detail in Section III. The mode structure and the mode selection are also explained in this section. The typical experimental results presented in Section IV Fig. 1. Exemplified text block: (a) amplified block; (b) luminance sample values. show the excellent performance of our scheme. Finally, we draw conclusions in Section V. II. PROPOSED COMPRESSION SCHEME We would like to first analyze the properties of compound images before introducing our scheme. Different from natural images, compound images have their own characteristics especially on the text and graphics parts. To explain it clearly, Fig. 1 shows an exemplified block with two letters P and o. First, edges in compound images between letters and background are much sharper than those in natural images. Some edges have several-pixel transition because of shadow effects, whereas the others do not have any transition. Second, geometries of edges are usually complicated and irregular. They are difficult to predict along a certain direction. Third, the block only has limited number of different sample values. For such text blocks, the intuitive feeling is that traditional transform will fail to give a compact representation in the transform domain. To verify it, we analyze the properties of text and graphics blocks in quantity by introducing two features: spectral activity measure (SAM) and spatial frequency measure (SFM). Both of them are proposed in [28]. Let us denote an image block as, and. and are the total numbers of samples in one column and one row, respectively. is the corresponding signal to in the frequency domain. SAM is a measurement of image predictability and it is defined in the frequency domain as In this paper, we use DCT instead of DFT to get because it is the transform for compression in our scheme. SAM has a dynamic range of. Lower values of SAM imply lower predictability. SFM indicates the activity level of an image. It is defined as (1) (2)

3 948 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Fig. 2. Histograms of SAM and SFM on text/graphics and natural image blocks. and are defined as row and column frequency, respectively. They indicate the variations between two rows and between two columns. An image with high activity in the spatial domain will have a large value of SFM. Images with small SAM and large SFM are usually difficult to compress by transform because they are not predictable. With the two measurements, we analyze 3150 text/graphics blocks and 4552 natural image blocks of size The SAM and SFM are calculated on each block. The histograms of SAM and SFM on those two types of blocks are depicted in Fig. 2. The SAM of the text and graphics blocks is concentrated in the lowvalue areas and it indicates a low predictability of such blocks; the SAM of the natural image blocks is scattered over a much wider range. However, the SFM of the text/graphics blocks is scattered over high-value areas and it means high activities in the spatial domain. Meanwhile, the natural image blocks are much smoother and have a compact distribution of SFM in low-value areas. All these indicate that the text and graphics blocks are hard to compress efficiently by transform coding compared with natural images. If transform is removed from the compression of text and graphics block, what techniques can be used to exploit the spatial correlations among samples? One way is to directly compress the prediction residues by entropy coding without transform. It has been successfully adopted in H.264 interframe coding in [29] and lossless intraframe coding in [27]. However, it is not applied to lossy intraframe coding of natural images yet because the residues still have some correlations that can be further exploited by transform. However, considering the extreme sharp edges and the surrounding shadows in text and

LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 949 Fig. 3. Block diagram of the proposed scheme. graphics blocks as depicted in Fig.

4 LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 949 Fig. 3. Block diagram of the proposed scheme. graphics blocks as depicted in Fig. 1, it should be more efficient than the transform coding. That motivates us to propose the spatial domain residual scalar quantization method to compress them. Moreover, we know that the number of colors in text and graphics parts is usually very limited compared with natural images, as shown in Fig. 1. In other words, text and graphics parts have much sparse distributions of histogram. This property is utilized in our paper by converting a text/graphics block to several base colors and an index map. This leads to a reduced symbol set representing a block. The causal neighboring indexes are treated as contexts for index coding. It is known as the proposed base colors and the index map method. How can the proposed two methods be integrated into the existing image coding techniques to efficiently compress various images? Fortunately, H.264 intraframe coding provides a flexible and efficient framework to accommodate them. The block diagram of the proposed scheme for coding compound images as well as natural images is depicted in Fig. 3. Similar to H.264, an input image is partitioned into regular and nonoverlapped blocks. There are three possible paths for coding each block. The first path is the existing method in H.264 intraframe coding. Intradirectional prediction and transform coding contribute much to the high compression efficiency on natural image parts. The second path is the proposed RSQ method. After the directional prediction, the residual is directly quantized and entropy coded. The third path is the proposed BCIM method. It does not use intraprediction. The input block is converted to a compact representation of several base colors and an index map. The above three methods can all be applied to different block sizes of 16 16, 8 8, and 4 4. The mode selection is completed by the RDO algorithm similar to that in H.264. In addition, no matter which mode is selected, the reconstructed samples will update the reference buffer as the reference for subsequent samples. III. DETAILS OF THE COMPRESSION SCHEME To adapt to the H.264 intraframe coding, the two proposed methods are developed as two intracoding modes: RSQ and BCIM. They will be discussed in detail in this section. A. RSQ Mode For text and graphics blocks containing edges of many directions as shown in Fig. 1, intraprediction along a single direction cannot completely remove the directional correlation among samples. After intraprediction, residues still preserve strong anisotropic correlation. In this case, it is not efficient to perform a transform on them. One method is to skip the transform and directly code prediction residues, which is similar to traditional pulse-code modulation (PCM). However, the question is whether the performance of PCM is better than that of a transform for text and graphics residual blocks. To answer it, we introduce the method proposed in [30], [31] to analyze the coding gain of PCM over a transform. Given the same rate, the coding gain is defined as the ratio of distortions on transform coefficients and residual samples, respectively, as is the distortion of PCM and is the distortion on transform coefficients. We assume the distortions result from the optimal quantization. Considering a stationary source, each sample will have the same variance. is then equal to is a factor that depends on the probability distribution function of signal and it is defined as is equal to is a factor that has a similar form to (5) and it depends on the probability distribution of the th transform coefficient, meanwhile corresponds to its variance. is the number of transform coefficients in a block, and it equals 64 for 8 8 size blocks. Substituting (4) (6) into (3), we have Thus, the coding gain can be obtained numerically in a statistical sense. If is larger than 1, it indicates that nontransform coding is more efficient than transform coding. To statistically investigate the coding gain of PCM over a transform on text and graphics residual blocks, we collect 8297 residual blocks of size 8 8 from text and graphics parts. Taking DCT as the transform, we get the coding gain by (7) numerically; it is 1.96 (larger (3) (4) (5) (6) (7)

5 950 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 number in that block. of and given as is the inverse quantization function (10) Fig. 4. Intraprediction in RSQ mode. (a) Eight prediction directions with direction 2 (DC mode) as an exception. (b) Example of intraprediction along direction 4. than 1). That indicates nontransform coding is more efficient than DCT transform coding on compressing text and graphics residual blocks. Thus, the residual scalar quantization method is developed to directly compress them. In the RSQ mode, eight prediction directions, as illustrated in Fig. 4(a), are designed to get better residues in the rate distortion sense. Different from H.264, the residual quantization and sample reconstruction are done pixel by pixel within a block in our scheme. The reconstructed pixel can be used for the prediction of the next pixel along the same direction. The use of intrablock reconstructed pixels for prediction is based on the assumption that the shorter the prediction distance is, the stronger the correlation will be. Taking direction 4 as an example as shown in Fig. 4(b), the pixel marked by b that is to be predicted will use the nearest reconstructed integer pixel a for prediction instead of the filtered boundary reconstructed sample M. For a given prediction direction, only the nearest reconstructed integer pixel along that direction without filtering is used for prediction. The reason is that the filtering performed on reconstructed pixels would blur sharp edges in text and graphics blocks and decrease the prediction accuracy. Thus, in directions 5, 6, 7, and 8, the reconstructed nearest integer pixel that is chosen for prediction will have a distance from the current pixel farther than a combination of reconstructed pixels. After directional intraprediction, quantization is performed on the residues. Similar to the method proposed in [29] and [32], which has been integrated into the KTA software [33], the quantization in our coding scheme is a sort of dead-zone plus uniform threshold quantization. It is described by a function of residual signal as is the quantization step and is the adaptive rounding offset value. Parameter controls the width of the dead zone, which is equal to. is used to get the largest integer that is not greater than. In order to maintain an equal expected value before quantization and after inverse quantization, the rounding offset is adaptively updated within a block as (9) is a positive weighting factor and is initialized to zero at the beginning of the quantization of each block. is the pixel (8) is used to get the nearest integer of. is a parameter related to the nonzero region offset. Note that if is zero, is zero too. To balance the rate with distortion, parameters, and all change with the quantization parameter in H.264. In the entropy coding part, quantized residues are compressed in a similar manner to the coding of DCT transform coefficients. However, they use separate context models considering the difference of their statistical distributions. Without the energy compacted to the left-up corner, the quantized samples are scanned line by line but not in a zigzag order. In addition, we find that nonzero quantized residues usually have the same amplitude within a text/graphics block. They are also bit-consuming for their large amplitudes. Therefore, we will use a flag to indicate whether the current quantized residual is the same as the former one or not. If it is the same, only the flag needs to be coded for that quantized residual. B. BCIM Mode Having limited colors but complicated shapes is another property of the text and graphics parts on compound images. Such text/graphics blocks can be expressed concisely by several base colors together with an index map. It is somewhat like color quantization that is a process of choosing a representative set of colors to approximate all the colors of an image [34]. In the BCIM mode, we first get the base colors of a block by using a clustering algorithm. All the base colors constitute a base color table. Then, each sample in the block will be quantized to its nearest base color. The index map indicates which base color is used by each sample. Different from color quantization, each text/graphics block, but not an entire image, has its own base colors and an index map for representation in our scheme. Thus, it is content adaptive for each block. In addition, since the base color number of a block is small, fewer bits are required to represent each mapped index. Let us take the luminance plane of a text/graphics block as an example, with each sample expressed by 8 bits. If four base colors are selected to approximate colors of that block, only two bits are required to represent each sample s index without compression. Thus, the total bits required to represent the block can be reduced from to, saving almost 3/4 of the bits. Taking chrominance components of a block into account, the base colors and an index map representation can also be applied on a vector block consisting of three color components such as that in the color space. In this case, each base color is a vector but only one index map is required for that vector block. For convenience, we will refer to the representation on a vector block as Dim3 and that only on luminance with chrominance components coded by conventional method as Dim1. The Dim3 case is efficient for blocks having high correlation among three color components and the Dim1 case is efficient for those with low correlation among them.

6 LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 951 In this mode, there are several key problems to solve. First, base colors should be chosen to well represent a block. Second, after base colors are selected, there should be a well designed entropy coding method to code the base colors and index map. If the coding method has been decided and is good enough, the problem is then simplified as the searching of the best base colors for that block. Base colors involve two aspects: Base color number and their values,. is a scalar in the Dim1 case and a vector in Dim3. In general, the larger the base color number is, the less the distortion of the approximation will be; but more bits will be spent on the coding of base colors and the index map because of the increasing of symbol set and base colors. Thus, a trade-off between distortion and rate should be given. Taking the minimization of the rate-distortion cost (Lagrangian cost) as the optimization objective, we can formulate it as a problem of getting the optimal base color number,as (11) An upper limit of base color number is set to restrict the symbol set size of indexes globally. It is four in our scheme considering the characteristics of text/graphics blocks. Thus, has a dynamic range of. is the Lagrangian multiplier the same as that in H.264. It is related to the quantization parameter in H.264. is the distortion of the representation using base colors. Correspondingly, represents its rate cost. To achieve the optimization in (11), should be optimal first at the given base color number. An image block of pixels corresponds to a set of points,. is a scalar in Dim1 and a vector, indicated by in the Dim3 case. Given the target base color number, the set is partitioned into subsets, (, and ). The colors in subset are then approximated by the base color. Targeting at minimizing the total distortion of the set, the determination of base color values on representing a block can be formulated as the searching of the optimal partition and their representative centers as (12) Clustering methods such as K-mean, LBG-VQ, and TSVQ (tree-structure vector quantization) can be adopted to solve (12). However, K-mean and LBG-VQ algorithms are sensitive to initial cluster centers and easy to get into local optimization. For the iterative updating, they are also time-consuming. In addition, to get with ranging from 1 to in (11), clustering should be done times. However, this will greatly increase the computational burden. One way to solve that is to set an adaptive termination condition and get the base color number adaptively. Ding et al. incorporate a distortion threshold into TSVQ to get the number of base colors by clustering only once [26] but they have not considered the rate cost. However, it will hurt the performance leaving the rate out of consideration or when the base colors are not representative enough. In this paper, we adopt dynamic programming to not only achieve the global optimal partition but also avoid intensive computation caused by clustering of multiple times and the iterative updating. [34], [35] give detailed discussions on dynamic programming for quantization. On one hand, as a clustering method, it is close to global optimization at the distortion criteria, especially for the Dim1 case. On the other hand, by calculating the best partition with base color number set to, the partition results with base color numbers ranging from 1 to are also available, facilitating the evaluation of rate costs for different base color numbers. This is because they are just the sub problems of the partition. For the Dim1 case, the complexity is just [34], where indicates the number of different colors in a block. Usually, for a text/graphics block. If dynamic programming is adopted for partition, the original data of a block should be sorted first. In the Dim1 case, it is easy to compare the sample values. In the Dim3 case, the comparison is done on the projected scalar values, which are attained by projecting input vectors,, to the principle axis of the data set as (13) is the principal eigenvector of the covariance matrix of the data set. For the projection, multiple vectors may be projected to the same scalar value. Then vectors having the same projected scalar value are replaced by their averages. With points of the same value recorded by their value and count, the points of set evolve to points of the set,. The count of will be indicated by. After the sorting of,wehave for the Dim1 case and for the Dim3 case. Then, dynamic partition is performed on the sorted data of the set. Here,. In our scheme, the partitioning of data set to clusters denoted by is set as the target problem. Correspondingly,,, as its sub problems, can be solved once is figured out. It can be explained by the fact that the partition of the first points of set must be optimal in the partition. This property ensures its high computational efficiency. In general, the partition of subset to clusters can be denoted by, named partition, where,. Let denote the th cutting point (last cutting point) in. Then the principle of optimality can be denoted as (14) is the distortion of the partition for data set. is the distortion of the set and equal to (15) is the representative point of set and represented by their mean here. can then be determined by a linear

indexes as contexts of the current index. search by (14) provided that,, is known.

7 952 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Fig. 6. Structure of modes in the proposed scheme. Fig. 5. Illustration of base colors and the index map coding: (a) the text block; (b) luminance sample values; (c) base color table with equal to four; (d) the index map; (e) the defined neighboring indexes as contexts of the current index. search by (14) provided that,, is known. Once is determined, becomes known as (16) That is how the optimal partition is constructed by the bottom-up dynamic programming, so are and its sub problems. Once the partition and representative base colors are available, the rate in (11) can be obtained by entropy coding. The base color number, base color values, and the index map will be encoded in turn. To better explain the coding procedure, we will take the Dim1 case on a block as an example and illustrate it in Fig. 5. The text block to be coded is depicted in (a), with (b) indicating the values of luminance samples. After the dynamic programming clustering, the partitions and corresponding representative base colors are obtained. One case with equal to four is shown in (c). According to that base color table, we choose a nearest base color for each sample of the block. This can be considered as the quantization of original colors to base colors. With quantized values replaced by their indexes in the color table, an index map is gotten and shown in (d). Afterwards, each index can be compressed by context-based entropy coding with four neighboring indexes illustrated in (e) as its contexts. However, if is 4, which means four possible symbols for each neighboring index, there are about kinds of contexts in total for the coding index considering its four neighboring positions. To reduce the memory and calculation requirements for so many kinds of contexts, we adopt the technology proposed in [26] to decrease the number of contexts. By mapping contexts having the same structure to one pattern, they can be reduced to 15 basic patterns as. If the boundary of the index map is considered, another nine context patterns need to be added and 24 context patterns are required in total. The letters in the four positions of each pattern denote indexes in the left, left-up, up, and right-up, respectively, identifying their similarities and differences. Taking the index forms of ({3111}) and ({2000}) with indicating the current index as examples, we can see that they all belong to the context pattern of, which means the indexes in the left-up, up, and right-up are the same but they are different from the left one. Before being sent to the context-entropy encoder, the current index should be first remapped to a new index related to its context pattern, adapting to the reduced context patterns. For simplicity of description, letters also indicate the index values for a concrete context pattern. The remapping rule of to is described as if if if (17) if. If is the same as some index in the neighborhood, will be directly encoded under its context pattern. However, if is not the same with any neighboring index, it will be supposed to be the same with letters out of that pattern and remapped. For of the pattern,if is 1, which is the same with, it will be remapped to 0; if is 3 and the same with, it will be remapped to 1; but when is 0 that is different from all the neighboring indexes, it will be remapped to 2 considering that the first letter out of that pattern is ;if is 2, larger than 0, it

8 LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 953 Fig. 7. Six testing images/documents used in our experiments: (a) Web Page; (b) Various; (c) Slide; (d) Files; (e) Natural Image; (f) Compound1. will be remapped to 3 corresponding to letter. The remapped index is then entropy coded under its context pattern. After the coding, the actual rate cost of is then found. We get the rate cost for each,, fairly straightforwardly. They will be used to (11) to decide the optimal base color number. Furthermore, when is known, base colors and the index map will be coded in the same method and put into the final bit stream. Instead of using the binarization schemes in H.264, such as Exp-Golomb, the base color number, values and remapped indexes are all simply expressed by their binary format. By exploring correlation among bits when coding each symbol by binary arithmetic as proposed in [36], even better performance than multisymbol arithmetic codec is achieved. Besides, to save bits on coding the values of base colors, we cut the size of their symbol set from 256 (0 255) to 33 (0,8,16 248,255) by (18) and represent the base color values before and after the operation respectively. It is beneficial to the saving of bits since it can decrease the size of the symbol set and concentrate their distributions. However, this will introduce more distortion because of the precision loss. In order to balance them, this processing is only performed on blocks. C. Mode Selection and Mode Structure Each mode has its advantages at dealing with blocks of different features. One question that arises here is how to fully take advantage of each mode in the proposed scheme. It can be solved by the RDO algorithm that has been adopted by H.264. The best mode with the best block partition having the minimum rate-distortion cost will be selected to compress the current block. All modes in the proposed scheme can be categorized into two types: spatial domain (SD) and DCT frequency domain (FD). There is a flag in the bit stream to distinguish them. The organized structure of all the modes is shown in Fig. 6. FD indicates the original intramodes in H.264, where the compression is performed in the DCT domain. SD indicates our proposed RSQ and BCIM modes. To adapt to the local nonstationary property ofcompound images, the spatial domain (SD) modes are applied to

954 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Fig. 8. Experimental results on rate distortion curves. 16 16, 8 8, and 4 4 block sizes as those DCT frequency domain (FD) modes.

9 954 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 Fig. 8. Experimental results on rate distortion curves , 8 8, and 4 4 block sizes as those DCT frequency domain (FD) modes. The best mode in the spatial domain is compared with the best mode in the DCT frequency domain for the same size block in the rate-distortion sense. The better one is selected. For 8 8 and 4 4 size blocks, 8 prediction directions are designed for the RSQ mode as shown in Fig. 4(a). The DC mode in the spatial domain is replaced by the BCIM mode in the stream syntax. For those small size blocks, the BCIM mode is only performed on the luminance component in our scheme for simplicity. In blocks, the BCIM mode takes the place of the DC intramode. The better one between Dim3 and Dim1 is selected based on the rate-distortion criteria. For the Dim3 case, when the input image format is YUV 4:2:0, interpolation will be performed on UV color planes to get the same size color planes to facilitate the 3-D clustering. To be consistent with the block size of the DCT transform, the selection between direct quantization and transform coding on the residual block, which is obtained by the prediction with three possible directions, is performed on 4 4 sub blocks. Fig. 9. Visual quality comparison of part ( ) of decoded Various at about 0.37 bpp: (a) part of the original image; (b) decoded by JPEG2000, PSNR Y db; (c) decoded by original H.264, PSNR Y db; (d) decoded by Proposed PSNR Y db.

LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 955 Fig. 10. Comparisons of Ding s approach, the scheme with the BCIM mode and the proposed scheme. IV.

10 LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 955 Fig. 10. Comparisons of Ding s approach, the scheme with the BCIM mode and the proposed scheme. IV. EXPERIMENTAL RESULTS We integrate the proposed methods into H.264/MPEG-4 AVC reference software JM14.0 [37] to evaluate their performances. Since the de-blocking filter in H.264 often blurs decoded compound images, it is disabled in our experiments. As shown in Fig. 7, five captured screen images and a compound document are used as test images: (a) is a web page, (b) is a snapshot of a typical screen scene, (c) is a slide, (d) is a combination of files, and (e) is a natural image. Their size is (f) is a compound document with size of Text and graphics are different in different compound images or in different regions of the same image. Some text and graphics blocks have no transition, whereas others have rich shadows. Some symbols on them are small and only occupy a or smaller size block but others may take up several blocks. Experimental results of the PSNR on the luminance signal versus bit-rate curves are depicted in Fig. 8. is set 47 to 7 with a step of 5. The results of H.264 integrated with only the proposed RSQ mode are marked by RSQ and those integrated with only the BCIM mode are marked by BCIM. The curves of H.264 with the RSQ and BCIM modes together are our proposed scheme, marked by Proposed. Another two schemes are selected for comparison: JPEG2000 and H.264 intraframe coding. For JPEG2000, we only compress the luminance plane by all bits with chrominance planes uncompressed. For the same bit budget, the results should be better than those all three color planes are compressed. As depicted in Fig. 8, the BCIM mode has a better performance at low and middle bit rates and the RSQ mode has a better performance at middle and high bit rates for the first four compound images. By selecting the RSQ mode or the BCIM mode by RDO, the proposed scheme presents good performance at all bit rates. Compared with H.264 intraframe coding and JPEG2000, more than 10-dB gain at most bit rates is achieved in Web Page. Similar results are observed in the other compound images. For the fifth image, which is a natural image, the proposed scheme has a comparable performance to H.264. For Compound1, about 10 db is obtained with RSQ mode or BCIM mode integrated, similar to the performance gain in the layer-based scheme in [14]. TABLE I PERCENTAGES OF THE RSQ AND BCIM MODES AT DIFFERENT RATES The percentages of the RSQ and BCIM modes in all testing images are given in Table I. They are obtained at three different rates: 0.4, 0.8, and 1.2 bpp. One can observe that the percentages of the RSQ mode increase with rate increasing, whereas the percentages of the BCIM mode decrease. But the phenomenon is not clear in Natural Image. It cannot be observed in Compound1 because of many binary texts. To evaluate the visual quality, parts of the magnified reconstructed images Various are shown in Fig. 9, at about 0.37 bpp. The images decoded by JPEG2000 and H.264 [(b) and (c), respectively] show severe blur and ring artifacts on the text and graphics parts. With the proposed two modes, the perceptual quality is greatly improved, close to the original image. The proposed scheme is also compared with Ding s approach [26]. Although the BCIM mode has the similar idea as that, the technology in the BCIM mode is improved significantly. In Ding s approach, the rate cost is not considered and the base colors are directly selected by tree-structure vector quantization; it cannot achieve optimized rate-distortion performance. In BCIM mode, the rate cost is considered and the base colors are selected in the sense of RDO with clustering method of dynamic programming. Moreover, the BCIM mode is adaptive to different block sizes and component combinations. Experimental results of Ding s approach, the scheme with BCIM mode only and the proposed scheme are shown in Fig. 10. It demonstrates

11 956 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010 the BCIM mode outperforms Ding s approach with a 1 4dB gain at low and middle bit rates and more than 3 db gain at high bit rates. Furthermore, the integration of the RSQ mode enables the proposed scheme to achieve a much higher performance at middle and high bit rates. Finally, the complexity of the proposed scheme is discussed. Since the transform is skipped in the proposed RSQ and BCIM modes, the decoding complexity of the proposed modes is lower than that of the H.264 intramodes. At the encoder, the complexity of the RSQ mode is low too for the same reason. But the proposed BCIM mode needs to select base colors by clustering and its complexity is a little higher than that of H.264 intramodes. In the mode selection, the rate-distortion costs of all modes are calculated and then the mode with the minimum cost is selected for coding. Because of the proposed modes, the mode selection needs to check double the choices than it does in H.264 intracoding. It can, however, be significantly decreased by fast mode selection in future. V. CONCLUSION We propose a compound image compression scheme by fully exploring spatial domain properties of compound images. Two spatial domain modes, called residual scalar quantization (RSQ) and base colors and the index map (BCIM), are integrated into H.264 intraframe coding; they achieve significant gains at all bit rates. The RSQ mode can cope with complicated text and graphics blocks in a simple way, which is just to quantize the intraprediction residues without a transform. The BCIM mode provides the ability to have a high performance improvement for the efficient representation form of the text/graphics block. They are both able to preserve the spatial structures of the text and graphics parts, important to visual quality. A rate distortion optimal method, similar to that in H.264, simplifies the mode selection and avoids the performance loss imported by the inaccurateness of segmentation. In short, this paper points out a good way to extend H.264 to compress compound images with simple technical extensions and to moderate complexity increasing because of addition mode selections. ACKNOWLEDGMENT The authors would like to thank Dr. Y. Lu, Dr. B. Qin, and W. Ding for their valuable discussions. REFERENCES [1] Cloud Computing [Online]. Available: Cloud_computing [2] W. P. Pennebaker and J. L. Mitchell, JPEG: Still Image Compression Standard. New York: Van Nostrand Reinhold, [3] D. Taubman and M. Marcellin, JPEG2000: Image Compression Fundamentals, Standards, and Practice. Norwell, MA: Kluwer, [4] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264/ISO/IEC AVC),, 2003, JVT of ISO/IEC MPEG and ITU-T, JVTG-050. [5] T. Wiegand, G. J. Sullivan, G. Bjntegaard, and A. Luthra, Overview of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp , Jul [6] W. Ding, F. Wu, X. Wu, S. Li, and H. Li, Adaptive directional lifting-based wavelet transform for image coding, IEEE Trans. Image Process., vol. 16, no. 2, pp , Feb [7] C. L. Chang and B. Girod, Direction-adaptive discrete wavelet transform for image compression, IEEE Trans. Image Process., vol. 16, no. 5, pp , May [8] H. Xu, J. Xu, and F. Wu, Lifting-based directional DCT-like transform for image coding, IEEE Trans. Circuits Syst. Video Technol., pp , Oct [9] B. Zeng and J. J. Fu, Directional discrete cosine transforms A new framework for image coding, IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp , Oct [10] K. Konstantinides and D. Tretter, A method for variable quantization in JPEG for improved text quality in compound documents, in Proc. Int. Conf. Image Processing, Chicago, IL, Oct. 1998, vol. 2, pp [11] A. Zaghetto and R. L. de Queiroz, Segmentation-driven compound document coding based on H.264/AVC-INTRA, IEEE Trans. Circuits Syst. Video Technol., vol. 16, pp , Jul [12] Mixed Raster Content (MRC) ITU-T Study Group 8, Draft Recommendation T.44, [13] L. Bottou, P. Haffner, P. Howard, P. Simard, Y. Bengio, and Y. LeCun, High quality document image compression using DjVu, J. Electron. Imag., vol. 7, no. 3, pp , Jul [14] A. Zaghetto and R. L. de Queiroz, MRC compression of compound documents using H.264/AVC-I, presented at the XXV Simpósio Brasileiro de Telecomunicações, Recife, Brazil, Sep [15] R. L. de Queiroz, Compressing compound documents, in The Document and Image Compression Handbook, M. Barni, Ed. New York: Marcel-Dekker, [16] R. L. de Queiroz, Z. Fan, and T. Tran, Optimizing block-thresholding segmentation for multilayer compression of compound images, IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp , Oct [17] H. Cheng and C. A. Bouman, Document compression using rate-distortion optimized segmentation, J. Electron. Imag., vol. 10, pp , Apr [18] R. L. de Queiroz, On data-filling algorithms for MRC layers, in Proc. Int. Conf. Image Processing, Vancouver, BC, Canada, Sep. 2000, vol. 2, pp [19] G. Lakhani and R. Subedi, Optimal filling of FG/BG layers of compound document images, in Proc. Int. Conf. Image Processing, Atlanta, GA, Oct. 2006, pp [20] R. L. de Queiroz, Pre-processing for MRC layers of scanned images, in Proc. Int. Conf. Image Processing, Atlanta, GA, Oct. 2006, pp [21] T. Lin and P. Hao, Compound image compression for real-time computer screen image transmission, IEEE Trans. Image Process., vol. 14, no. 7, pp , Jul [22] A. Said and A. Drukarev, Simplified segmentation for compound image compression, in Proc. Int. Conf. Image Processing, Kobe, Japan, 1999, vol. 1, pp [23] X. Li and S. Lei, Block-based segmentation and adaptive coding for visually lossless compression of scanned documents, in Proc. Int. Conf. Image Processing, Oct. 1999, vol. I, pp [24] W. Ding, D. Liu, Y. He, and F. Wu, Block-based fast compression for compound images, in Proc. Int. Conf. Multimedia and Expo (ICME), 2006, pp [25] D. Mukherjee, C. Chrysafis, and A. Said, Low complexity guaranteed fit compound document compression, in Proc. I, Sep. 2002, vol. 1, pp [26] W. Ding, Y. Lu, and F. Wu, Enable efficient compound image compression in H.264/AVC intra coding, in Proc. Int. Conf. Image Processing, Oct. 2007, vol. 2, pp [27] Y. Lee, K. Han, and G. Sullivan, Improved lossless intra coding for H.264/MPEG-4 AVC, IEEE Trans. Image Process., vol. 15, pp , [28] M. Mrak, S. Grgic, and M. Grgic, Picture quality measures in compression systems, in Proc. Int. IEEE Region 8 Conf. Computer as a Tool EUROCON, Sep. 2003, vol. 1, pp [29] M. Narroschke and J. Ostermann, Adaptive Prediction Error Coding in Spatial and Frequency Domain for H.264/AVC ITU-T Q.6/GS16,doc. VCEG-AB06. Bangkok, Thailand, Jan [30] Y. Wang, J. Ostermann, and Y. Zhang, Digital Video Processing and Communications. Upper Saddle River, NJ: Prentice-Hall, [31] N. Jayant, Signal Compression, Coding of Speech, Audio, Image and Video. Singapore: World Scientific, 1997.

LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 957 [32] G. J.

[33] Key Technical Area Software of the ITU-T, 2008 [Online]. Available: http://iphome.hhi.de/suehring/tml/download/kta/, version jm11.0kta1.8 [34] X.

thesis, Massachusetts Inst. Technol., Cambridge, May 14, 1964. [36] C. S. Lee and H. W. Park, Multisymbol data compression using a binary arithmetic coder, Electron. Lett., vol. 38, no. 3, pp.

12 LAN et al.: COMPRESS COMPOUND IMAGES IN H.264/MPGE-4 AVC BY EXPLOITING SPATIAL CORRELATION 957 [32] G. J. Sullivan, Adaptive quantization encoding technique using an equal expected-value rule, presented at the JVT Meeting Contribution JVT-N011, Hong Kong, Jan [33] Key Technical Area Software of the ITU-T, 2008 [Online]. Available: version jm11.0kta1.8 [34] X. Wu, Color quantization by dynamic programming and principal analysis, ACM Trans. Graph., vol. 11, no. 4, pp , Oct [35] J. D. Bruce, Optimum Quantization, D.Sc. thesis, Massachusetts Inst. Technol., Cambridge, May 14, [36] C. S. Lee and H. W. Park, Multisymbol data compression using a binary arithmetic coder, Electron. Lett., vol. 38, no. 3, pp , Jan [37] Reference Software of H.264/AVC 2008 [Online]. Available: iphome.hhi.de/suehring/tml/download/old_jm/, version jm14.0 Cuiling Lan received the B.Sc. degree in electrical engineering from Xidian University, China, in She is currently pursuing the Ph.D. degree at the College of Electrical Engineering, Xidian University. Currently, her research interests focus on image and video compression. Guangming Shi (M 07) received the B.S. degree in electronic engineering in 1985, the M.S. degree in automation in 1988, and the Ph.D. degree in intelligent signal processing in 2002, all from the Xidian University, China. His research interests include image and video compression, compressed sensing theory and its application on UWB, signal sampling and processing theory, inverse problems in imaging, and optimization. Feng Wu (M 99 SM 06) received the B.S. degree in electrical engineering from Xidian University, China, in He received the M.S. and Ph.D. degrees in computer science from the Harbin Institute of Technology, China, in 1996 and 1999, respectively. He joined in Microsoft Research Asia, formerly named Microsoft Research China, as an Associate Researcher in He has been a researcher with Microsoft Research Asia since 2001 and is now a Senior Researcher/Research Manager. His research interests include image and video representation, media compression and communication, as well as computer vision and graphics. He has authored or coauthored over 150 papers published in various IEEE journals and some other International conferences and forums, e.g., MOBICOM, INFOCOM, ICIP, ICME, and ISCAS. Dr. Wu received (as a coauthor) the best paper award at IEEE CSVT 2009, PCM 2008, and SPIE VCIP 2007.

ISSN: An Efficient Fully Exploiting Spatial Correlation of Compress Compound Images in Advanced Video Coding

ISSN: An Efficient Fully Exploiting Spatial Correlation of Compress Compound Images in Advanced Video Coding An Efficient Fully Exploiting Spatial Correlation of Compress Compound Images in Advanced Video Coding Ali Mohsin Kaittan*1 President of the Association of scientific research and development in Iraq Abstract