AMONG various transform techniques for image compression,

Size: px

Start display at page:

Download "AMONG various transform techniques for image compression,"

Abner Brian Beasley
6 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE A Cost-Effective Architecture for 8 8 Two-Dimensional DCT/IDCT Using Direct Method Yung-Pin Lee, Student Member, IEEE, Thou-Ho Chen, Liang-Gee Chen, Senior Member, IEEE, Mei-Juan Chen, Student Member, IEEE, and Chung-ei Ku, Student Member, IEEE Abstract In this paper, we develop a novel twodimensional (2-D) discrete cosine transform/inverse discrete cosine transform (DCT/IDCT) architecture based on the direct 2-D approach and the rotation technique. The computational complexity is reduced by taking advantage of the special attribute of complex number. Both the parallel and the folded architectures are proposed. Unlike other approaches, the proposed architecture is regular and economically allowable for VLSI implementation. Compared to the row-column method, less internal wordlength is needed in order to meet the error requirement of IDCT, and the throughput of the proposed architecture can achieve two times that of the row-column method with 30% hardware increased. Index Terms DCT/IDCT, direct 2-D approach, finite wordlength analysis. I. INTRODUCTION AMONG various transform techniques for image compression, the discrete cosine transform (DCT) [1] is the most popular and effective one in practical image and video coding applications, such as high-definition television (HDTV). This is due to the fact that it can give an almost optimal performance and can be implemented at an acceptable cost. There are three methods to realize two-dimensional (2-D) DCT: 1) indirect method by the row-column decomposition [2] [5], 2) direct method [6] [9], such as using polynomial transform [6], and 3) using other transforms, such as discrete Fourier transform (DFT) and discrete Harltley transform (DHT) [10]. The indirect method has the advantage of regularity for very large scale integration (VLSI) implementation. Therefore, most chips for 2-D DCT had been implemented by indirect method [2] [5]. In [2], an D DCT/inverse DCT (IDCT) processor chip that can be used for high data rate image and video coding, higher than 55 MHz while using lowcost 2- m CMOS technology, is presented. To be applicable to the real-time processing of HDTV signals, [3] develops a 100-MHz 2-D DCT core processor which introduces a fast DCT algorithm and multiplier accumulators based on distributed arithmetic to reduce the hardware amount and enhance the speed performance. Reference [5] also proposes Manuscript received February 6, 1996; revised September 9, This paper was recommended by Associate Editor N. Demassieux. This work was supported by National Science Council under Contract #NSC E , Taiwan, R.O.C. Y.-P. Lee, L.-G. Chen, M.-J. Chen, and C.-. Ku are with the DSP/IC Design Lab, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. T.-H. Chen is with the Department of Electronic Engineering, Nan-Tai College, Tainan, Taiwan, R.O.C. Publisher Item Identifier S (97) a 100-MHz 2-D 8 8 DCT/IDCT processor, in which the architecture is entirely multiplierless and highly parallel, but the chip area is only half that of [3]. However, the computation amount of the indirect method is more than that of the direct method. The direct method requires fewer computations, but it incurs the irregularity. Reference [6] provides a direct DCT algorithm based on polynominal transform techniques with the lowest computation amounts known so far based on algorithms. To improve the regularity, [8] presents a fast 2-D DCT algorithm which requires only one-dimensional (1-D) DCT s and additions, instead of using 1-D DCT s, as in the conventional row-column approach. In fact, those direct methods are not suitable for 8 8 DCT VLSI implementation. Nevertheless, the feature of low computation complexity is still attractive. This fact motivated that a low-computation and regular 2-D DCT structure has been researched recently. In this paper, we propose a cost-effective architecture for D DCT architecture which bears both the advantages of high regularity and less computation amount. At first, the real number input is mapped into complex number in the 2-D DCT [6]. Then the computation complexity can be reduced by the rotation techniques in the complex number system. For D DCT, further modification is required to make the architecture more regular, and this results in the fact that the architecture can be folded to an economically allowable size for VLSI implementation. The finite wordlength analysis for IDCT demonstrates that the proposed architecture requires fewer internal bits than other methods. In the following section, we illustrate that an 2-D DCT/IDCT can be realized by only -point 1-D DCT/IDCT s and some additional summations. ith some modifications in Section III, we can obtain a more regular architecture for D DCT/IDCT s. Finally, we analyze the internal wordlength problem and compare the hardware complexity between the row-column method and the proposed design. II. METHODOLOGY A. The Mapping of Input Data The 2-D DCT [1] of an real signal is defined as (1) /97$ IEEE

2 460 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE 1997 and For convenience, we introduce factor so that for by neglecting the kernel Fig. 1. The mapping from x n ;n to y n ;t when N = 4. and (2a) (2b) In the following, we will assume to be a power of two. Using the permutation [6], signal can be permuted as shown at the bottom of the page. Equation (2a) can be rewritten as Now consider the following expression: where (3) (4) e can easily compute from by the following set of expressions: Re Im Im Re (5) Note that (5) requires in (4) to be computed for all and only a sufficient subset of such that cover all possible values of [6]. By the following relation where (6) the signal is mapped as. If is fixed, the mapping from to is one-to-one. However, with different, the mapping order is not the same. For example, the case of when is 0, maps to ; when is 1, maps to. Fig. 1 shows the mapping of inputs from to when. The modulo operation in (6) disappears because the period of is just. By substituting (6) into (4), (4) can be rewritten as (7a) (7b) In (7b), is no longer in the range from 0 to, which is a common attribute of the ordinary transform. Consider the following relation: where integer and By substituting the above relation into (7b), we can obtain (8) Let the computation of s summation be represented by. Then we can find B. The Proposed 2-D DCT Algorithm As shown in (4), the exponential term could be treated as a rotation. In row-column method, the term should be rotated twice separately, one is for row computation, and another is for column computation. However, with some relation between and, the term can be realized by rotating only once to reduce computations.

3 LEE et al.: COST-EFFECTIVE ARCHITECTURE FOR D DCT/IDC 461 Although is a complex number, its real part is indeed an -point 1-D DCT, and its imaginary part can be obtained by the relation Im Im Re This reveals that can be achieved by calculating - point 1-D DCT. Since multiplying does not need any multiplication, but only affects the addition, an 2-D DCT can therefore be realized by -point 1-D DCT s with some additions. Nevertheless, the row-column method needs 2 -point 1-D DCT s. Similar results have been deduced in [8] with a different approach, but its structure is not regular, and so is unsuitable for VLSI implementation. To overcome this problem, the proposed algorithm develops a regular architecture as illustrated in the following section. (a) III. ARCHITECTURE OF AN D DCT To realize the 2-D DCT, the additions are always irregular, especially when is large. However, most proposed video compression standards such as H.261, JPEG [12], MPEG-1, and MPEG-2 [13] need only D DCT/IDCT. In this section, we present a regular structure for D DCT/IDCT and further can fold the architecture to one forth of original size. It should be noted that the IDCT architecture can be derived by the same deduction. A. The Parallel D DCT Architecture As mentioned in [6], when 8, it is only to compute for all but and. Then the summation of in (7a) with different can be expanded as shown in (9a) (9e) at the bottom of the page. For implementation, we partition the computation from (9a) (9e) together with (5) into three stages. Stage 1) Pre-addition: Computing the values of, and according to (9a) (9e), respectively. The architecture for realizing the pre-addition is described in Fig. 2. This Fig. 2. (b) Stage 1 Preaddition when n 1 is (a) even and (b) odd. architecture includes butterfly computations and two constant multiplications (multiplied by ) for each. For the case of the term in (9d) and (9e) when computing and, both substructures for even and odd are different, as shown in Fig. 2(a) and (b). where (9a) where (9b) where (9c) where (9d) where (9e)

4 462 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE 1997 TABLE I RELATION AMONG k 1 ; k2; a,and b Stage 2) Complex DCT and Rotation: Computing, and according to (9a) (9e), respectively. At first, consider, and. In three such cases, the inputs are complex numbers, and the computations can be concluded as Fig. 3. Data flow for Stage 2. where (10) The above equation is like 1-D DCT, except the input is a complex number and the exponential term is not in the range from 0 to. If the term is replaced by, where and are integers, and, then we can derive Fig. 4. Stage 3 Postaddition. (13) Thus, we can easily obtain and from by (11) The relation among, and is illustrated in Table I. Therefore, the computation of can be achieved by the data flow in Fig. 3 which consists of the following three steps. Step a) Calculate the summation of in (11) for the values of from zero to seven. Step b) The output from a) is multiplied by. Step c) According to, the output from b) is rotated to make the output order to be the increasing order of. Second, both inputs of and, and, are real numbers. Therefore, the real part of and is an even function, and the imaginary part of and is an odd function. Let (12) (14) where ( ) denotes the process of complex conjugation, and the index implies that the index in (12) is replaced by. Equation (12) can be realized by step a). Based on the same concept as (11), the relation between and is concluded as when when Stage 3) Postaddition: Computing and in (5). Apparently, it is a butterfly operation, as shown in Fig. 4. Fig. 5 demonstrates the parallel architecture to compute for all values of and. As mentioned in Section II, signal is in the original order of the input, but signal is according to the permuted order of the.itis clear that Stage 1 is very regular and local. Between the Stages

5 LEE et al.: COST-EFFECTIVE ARCHITECTURE FOR D DCT/IDC 463 Fig. 5. Parallel architecture of the D DCT. 1 and 2, the interconnection is not local, but it is still regular. This feature of regularity can be utilized to fold the architecture, as described in the next subsection. The Stage 2 requires mainly four eight-point complex DCT s, but the computation of with 0, 4 needs an additional butterfly stage. Nevertheless, this additional butterfly to compute (14) is the same as Fig. 4. Stage 3 is composed of butterfly adders, as shown in Fig. 4. Based on the above implementation strategy, the proposed architecture reveals that it is more regular than [8].

6 464 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE 1997 Fig. 6. Folded architecture for D DCT. B. Folding the Parallel Architecture of D DCT Fig. 5 describes the proposed parallel architecture with 64 input data and 64 output data. To be suitable for VLSI implementation, the inherent problems of bandwidth limitation and chip size should be solved. Folding technique is introduced to cope with the problem in the architecture. The folded D DCT/IDCT structure, as shown in Fig. 6, is then obtained from the parallel architecture in Fig. 5 through folding strategy. The input data is now grouped into four sets:,, and, and fed to the folded architecture in a sequential way. Thus, an additional re-ordering circuit is needed to transfer the input with the input order into the desired order. In the re-ordering circuit, Fig. 7, there are three register files (RF0, RF1, RF2), and each register file can store eight input data. For example, the number 0 in Fig. 7(a) means that the data is temporarily stored in the register file. The desired order is than derived by the appropriate control signal. Similarly, the output port also needs another reordering circuit to rearrange the output sequence. To reduce the cost of post re-ordering, this re-ordering circuit can be merged into the zigzag circuit when the DCT is applied to video standards. In Fig. 6, Stage 1 is referred to Fig. 2. The interconnection in Fig. 5 is now realized by four 4-by-4 transpose memories. To illustrate the implementation of the eight-point complex DCT, let the eight-point complex DCT be complex number Re Im This computation requires two 1-D DCT s together with 16 additions. The existing fast 1-D DCT algorithms or suitable 1-D DCT architectures can be adopted. The multiplexer and circular shifter are dedicated to implement the steps b) and c) in Stage 2 as mentioned in the previous subsection. The circular shifter has four selection ways: shift 0, 5, 2, or 1. To compute with to and in Stage 3 of Fig. 5, an extra computation of a butterfly adder is Fig. 7. (a) (b) Prereorder. (a) Prereorder process and (b) circuit of prereorder. needed in Fig. 6. The folded size is now reasonable for chip implementation. Since the pipeline scheme will be introduced into the proposed architecture, the critical path will appear in the 1-D DCT. That is, the clock rate of the proposed architecture can be as fast as that of the existed 1-D DCT architecture. IV. FINITE ORDLENGTH ANALYSIS By reversing the data flow of the above DCT, we can obtain the IDCT architecture. hen implementing an IDCT architecture, there are two inherent errors which will reduce the accuracy of the output; one is quantization error of coefficients and another is finite internal wordlength. Therefore, the Joint CCITT/ISO committee has established a specification to evaluate the errors caused by finite wordlength in IDCT [12]. According to this specification, the proposed IDCT architecture will require blocks of random numbers in the range from 256 to 255 as the input. The plots of overall mean square error versus internal wordlength with different coefficient wordlengths are shown

7 LEE et al.: COST-EFFECTIVE ARCHITECTURE FOR D DCT/IDC 465 (a) (b) (c) (d) Fig. 8. Finite wordlength analysis. (a) Overall mean square error analysis. (b) Peak mean square error analysis. (c) Overall mean error analysis. (d) Peak mean error analysis. in Fig. 8(a). The horizontal dashed line is the upper bound for overall mean square error. Fig. 8(b), (c), and (d) describe the analysis of peak mean square error, overall mean error, and peak mean error, respectively. The above analysis implies that 11-bit coefficient wordlength and 17-bit internal wordlength will be enough to satisfy all the four error requirements. Compared to the row-column approach as proposed by [2] in which the required internal wordlength is 22 b, the proposed approach needs fewer bits. This is because that the proposed architecture requires fewer multiplications. Besides, the input of the row-column method needs to be multiplied by the coefficients twice in a cascade way. Anyway, most inputs of the proposed method need only to be multiplied once. Based on the same analysis, Table II shows the results for three different input ranges ([ 256: 255], [ 5: 5], and [ 300: 300], which are also mentioned by Joint CCITT/ISO), with 12-b coefficient wordlength and 18-b internal wordlength. It is clear that almost all the values are much smaller than the specifications. TABLE II ACCURACY ANALYSIS FOR THREE DIFFERENT INPUT RANGES V. HARDARE COMPARISON ITH RO-COLUMN METHOD According to the Fig. 6, the proposed design can be implemented majorly by two 1-D DCT s and four 4 4 transpose memory, which is just required by row-column design method (requiring one 8 8 transpose memory). In addition, the proposed folded architecture requires 76 extra adders and four extra constant multipliers.

8 466 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE 1997 TABLE III TRANSISTOR COUNT FOR BOTH THE PROPOSED METHOD AND THE RO-COLUMN METHOD irregularity; on the other hand, the row-column method is more regular with the penalty of requiring more computations. However, the proposed architecture has both the advantages of low computation complexity and high regularity. According to the specification by Joint CCITT/ISO committee on the IDCT, the proposed design needs only a coefficient wordlength of 12 b and an internal wordlength of 18 b. Compared with the row-column method, the evaluated chip size of the proposed architecture is over about 30%, but the throughput is twice that of the row-column method. The above results show that the proposed 2-D DCT/IDCT architecture is more attractive than other methods. REFERENCES To calculate the transistor count for both the row-column method and the proposed method, we adopt the architecture of the FCT [11] for the 1-D DCT block. The transistor count for each element (FA, DFF, etc.) is referred to [14]. According to Section IV, to satisfy all the error requirements of IDCT, the internal and coefficient wordlengths are 18-b and 12-b, respectively, for the proposed method. But the internal wordlength of the row-column method is 22 b. The circular shifter is now implemented by multiplexers. The postreorder is merged into the zigzag hardware and hence can be ignored. The evaluation result shows that the proposed method needs about 30% transistor overhead compared to the row-column method, as illustrated in Table III. However, the proposed design can be fed with 16 inputs at a time, while the row-column method can only be fed with eight inputs. This implies that if the proposed architecture and the rowcolumn architecture have the same internal clock rate, the throughput of the proposed architecture is twice that of the row-column architecture. To double the pixel rate, two input ports or doubling the external clock for pixel input can be used. Nowadays, a 1-D DCT architecture can run at least 100 MHz. Therefore, the proposed architecture can run at least a pixel rate of 200 MHz. It should be noted that if the bit-serial word-parallel is adopted for implementation, the total transistors will be far smaller than these data in Table III. VI. CONCLUSIONS This research analyzes the 2-D DCT/IDCT algorithm using direct method. The required multiplications are about halved to the row-column algorithm. In order to overcome the irregular problem, a regular architecture derived from the proposed algorithm for the 8 8 DCT/IDCT is developed. After folding the architecture, the chip size will become very allowable economically for VLSI implementation. Traditionally, direct method has less computation complexity but [1] N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform, IEEE Trans. Commun., vol. COM-23, pp , Jan [2] D. Slawecki and. Li, DCT/IDCT processor design for high data rate image coding, IEEE Trans. Circuits Syst. Video Technol., vol. 2, pp , June [3] S. Uramoto et al., A 100 MHz 2-D discrete cosine transform core processor, IEEE J. Solid-State Circuits,vol. 27,pp ,Apr [4] T. Miyazaki et al., DCT/IDCT processor for HDTV developed with DSP silicon complier, J. VLSI Signal Processing, vol. 5, pp , June [5] A. Madisetti and A. N. illson, A 100 MHz 2-D DCT/IDCT processor for HDTV applications, IEEE Trans. Circuits Syst. Video Technol., vol. 5, pp , Apr [6] P. Duhamel and C. Guillemot, Polynomial transform computation of 2-D DCT, in Proc. ICASSP 90, Apr. 1990, pp [7] M. Vetterli, Fast 2-D discrete cosine transform, in Proc. ICASSP 85, Mar. 1985, pp [8] N. I. Cho and S. U. Lee, Fast algorithm and implementation of 2-D discrete cosine transform, IEEE Trans. Circuits Syst., vol. CAS-38, pp , Mar [9] N. I. Cho, I. Y. Dong, and S. U. Lee, On the regular structure for the fast 2-D DCT algorithm, IEEE Trans. Circuits Syst. II, vol. 40, pp , Apr [10] J. H. Hsiao, L. G. Chen, T. D. Chiueh, and C. T. Chen, High throughput CORDIC-based systolic array design for the Discrete Cosine Transform, IEEE Trans. Circuits Syst. Video Technol., vol. 5, pp , June [11] B. G. Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp , Dec [12] ISO/IEC JTC1/SC29/G10, JPEG Committee Draft CD 10918, [13] ISO/IEC JTC1/SC29/G11, MPEG Committee Draft CD 11172, [14] Compass 0.6-Micron, 5-Volt, High-Performance, Standard Cell Library, June Yung-Pin Lee (S 91) was born in Taipei, Taiwan, in He received the B.S. degree in electrical engineering from National Taiwan University, Taipei, in He is currently a Ph.D. candidate in electrical engineering from National Taiwan University and will graduate in His current research interests include video and audio coding systems, DSP architecture design, video signal processor design, and VLSI signal processing. Thou-Ho Chen was born in Kaohsiung, Taiwan, R.O.C., in He received the B.S. and M.S. degrees in electrical engineering from National Cheng Kung University in 1985 and 1989, respectively, and the Ph.D. degree in electrical engineering from National Taiwan University, Taipei, in Currently, he is an Associate Professor in the Department of Electronics Engineering at Nan-Tai Institute of Technology. He is also the chief consultant of Arphic Technology Co. His research interests are fault-tolerant computing, VLSI design in DSP/DIP architecture, and applications of computer graphics.

9 LEE et al.: COST-EFFECTIVE ARCHITECTURE FOR D DCT/IDC 467 Liang-Gee Chen (S 84 M 86 SM 94) was born in Yun-Lin, Taiwan, in He received the B.S., M.S., and Ph.D. degrees in electrical engineering from National Cheng Kung University in 1979, 1981, and 1986, respectively. He was an Instructor ( ) and an Associate Professor ( ) in the Department of Electrical Engineering, National Cheng Kung University. In the military service during 1987 and 1988, he was an Associate Professor in the Institute of Resource Management, Defense Management College. In 1988, he joined the Department of Electrical Engineering, National Taiwan University, Taipei. During 1993 to 1994 he was Visiting Consultant of DSP Research Department, AT&T Bell Lab, Murray Hill, NJ. Currently, he is Professor of National Taiwan University. His current research interests are DSP architecture design, video processor design, and video coding system. Dr. Chen is a senior member of the Association for Computing Machinery. He is also a member of the honor society Phi Tau Phi. He received the Best Paper Award from ROC Computer Society in 1990 and In 1991, 1992, 1993, and 1994, he received Long-Term Paper Awards annually. In 1992, he received the Best Paper Award of the 1992 Asia-Pacific Conference on Circuits and Systems in VLSI design track. In 1993, he received the Annual Paper Award of Chinese Engineer Society. Mei-Juan Chen (S 91) was born in Taipei, Taiwan, in She received the B.S., M.S. and Ph.D. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1991, 1993, and 1997, respectively. Since 1993, she has been an instructor at St. John and St. Mary Institute of Technology. Her major research interests include image processing, video compression, and VLSI design. Dr. Chen is a member of the honor society Phi Tau Phi. In 1993, she received the Long-Term Paper Award and Xerox Paper Award. Chung-ei Ku (S 91) was born in Taipei, Taiwan, in He received the B.S. degree in electrical engineering from National Taiwan University, Taipei, in He is currently a Ph.D. candidate in electrical engineering from National Taiwan University. His current research interests include visual signal representation, very low bit-rate video coding, multimedia telephony, and VLSI design for DSP.

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen