DCT/IDCT Constant Geometry Array Processor for Codec on Display Panel

Size: px

Start display at page:

Download "DCT/IDCT Constant Geometry Array Processor for Codec on Display Panel"

Alban Quinn
5 years ago
Views:

1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 ISSN(Print) ISSN(Online) DCT/IDCT Constant Geometry Array Processor for Codec on Display Panel Jaehee You Abstract Algorithms, architectures and 8 X 8 processing element chip design are discussed based on constant geometry -Dim. DCT and IDCT. Both DCT and IDCT can be computed with the same hardware for image codec applications. An array of identical PEs is used for butterfly as well as recursive addition stages only with programming of regular interconnections between stages for fault tolerance of system on panel or flexible display panel. Also methodologies to optimize computation speed and amount of required hardware are discussed. An efficient distributed arithmetic ROM for coefficient multiplication is proposed to minimize chip area and to facilitate the PE programming. Chip area, computation time, with image qualities are evaluated with advantages. Index Terms DCT, IDCT, system on panel, constant geometry, fault tolerance, distributed arithmetic I. INTRODUCTION Nowadays, mobile devices such as smart phones need higher resolution image requiring higher data rate with large power consumption. System on Panel (SOP) [1] or Glass(SOG) [] approaches are being investigated to integrate all the peripheral systems on display panel eliminating overheads due to interconnections. These make high bandwidth image data transfer possible from peripherals [3] with on chip interconnections. Also Manuscript received Aug. 3, 17; accepted Jul. 3, 18 School of Electronic and Electrical Engineering, Hongik University 7-1 Sangsudong, Mapogu Seoul, Korea jaeheeu@hongik.ac.kr weight and system size due to cables, connectors and controller can be reduced. For the flexible display panel, total integration of all the peripherals is necessary. However, SOP has an inherent drawback on the yield of processing technologies such as LTPS and organic semiconductors. It can be mitigated with fault tolerance such as redundancies. Mainly immediate peripheral circuitries such as data driver, gate driver and DC-DC converter have been considered to be integrated on display panel so far [4]. An experimental microprocessor with small scale frame buffer and audio circuits integrated on display panel have been reported [1, 5, 6]. These results show that more systems such as image codec or frame memory are expected to be integrated on display panel in the future [7]. Considering low yield processing technologies, power, size and weight constraints, DCT and IDCT are to be first to be considered for SOP image codec, since [8] found that -Dim. discrete cosine transform (DCT) implementation with ASIC reduces energy use by 36.8% with the factor of ten to thirty compression rates. Since most of mobile devices need image codec for multimedia messaging, it is essential to implement both DCT and IDCT [9, 1]. Fig. 1 shows a comparison between conventional display system and the proposed SOP with on-chip DCT/IDCT. Next to LCD panel backlight and CPU, offchip bus communication and frame buffer are the major sources of power consumption [9, 1], which need to be minimized by on-chip implementation. If, in addition, IDCT can be used for frame memory data decompression, it would be possible to reduce the frame memory bus data rate, the amount of frame buffer and the corresponding power consumption by a factor of ten-to-

2 44 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL Fig. 1. System on panel with compressed image storage and transfer. thirty, which is compression rate based on DCT. Clearly, this would be beneficial especially for mobile display systems [11]. Low processing technology yield problem in SOP can be alleviated by fault tolerance capability. The existing DCT and IDCT architectures have difficulties in realizing fault tolerance and in obtaining optimized speed and the amount of hardware according to application areas due to irregular architecture [1]. Most of D DCT or IDCT VLSI algorithms can be categorized as D direct DCT [13] and D DCT with row column transposition [14]. Both of them have irregular interconnection structure among computation units, not suitable for the fault tolerance. Systolic array architectures [15] require processing elements interconnected with irregular data delay stages to control data flow timing. To solve these problems, the author proposed a high speed DCT algorithm to make it possible to compute DCT based on decomposed matrices with the same structure for arbitrary radices [16]. In [17], constant geometry DCT and IDCT VLSI algorithms are presented to facilitate fault tolerance and testability by the author. However, still two different structures for both butterfly stage and recursive addition stage need to be made identical to be implemented with identical hardware. Distributed arithmetic [18, 19] can be used for multipliers to reduce hardware, which is advantageous for low yield processing technologies. In [], bitserial/bit parallel architecture are used to implement 16 x 16 DCT chip with distributed arithmetic for hardware amount optimization. However, it is also lack of modularity and flexibility, which is not suitable for redundancy replacement. In this paper, VLSI architecture and its implementation methodologies are introduced for a processing element programmable for both DCT and IDCT with constant geometry, suitable for SOP. In chapter II, a DCT and IDCT constant geometry algorithm with identical butterfly and recursive addition stage is described. A DCT/IDCT VLSI architecture and VLSI implementation methodologies with experimental 8 x 8 processing element chip design results are described in chapter III and VI, respectively. Performance analyses are discussed with advantages in chapter V. II. -DIM. CONSTANT GEOMETRY DCT/IDCT ALGORITHM As shown in Fig. 1, as one of possible SOP display panel systems, a decoder with IDCT can be integrated on the display panel so that only the compressed image data are transferred to the display panel saving more than 5% of energy, which will be discussed in detail later. Also, an on-chip embedded frame buffer is used with an on-chip bus for the transfer of large amount of uncompressed frame data. The size of the frame buffer can be reduced by minimizing the time difference between the decompressions and the fetching of the decompressed data by using an expanded line buffer. To implement the DCT and IDCT in display panels, identical PE array can alleviate the low yield problem with redundancies. Therefore, constant geometry architecture for both butterfly and recursive addition stage can make DCT and IDCT to be implemented using same PE s, which has the advantages of the testability and fault tolerance. For unified architecture for DCT and IDCT, a -Dim. IDCT algorithm can be obtained as shown in () basically by taking the inverse of the DCT. The shuffling for the inputs (S S), outputs (Q Q) and interstage interconnections (SF) can be implemented by programming the cross points of horizontal and vertical interconnections as shown in Fig.. Only high yield interconnections between butterfly stages excluding low yield semiconductor devices needs to be programmed without redundancy. Furthermore, since interconnections are less susceptible to yield, if entire DCT and IDCT can be made of the same PE s, image codec system can be implemented with shared redundancies only by interconnection programming to alleviate low yield

3 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 Fig.. Contrast geometry DCT/IDCT PE array with redundancies. (a) 45

4 46 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL (b) (c) (d) (e) Fig. 3. Procedures to make I N/p T from A q and A t (a) Duplication and row shuffling for A q to V q, (b) Column shuffling for V t-1 to B t-1 (q = t - 1), (c) Column shuffling for B q to I N/p T q (1 q t - 1), (d) The structure of A t and I N/p T t (q = t), (e) The structure of I N/p T q (1 q t). problem. The proposed array architecture is advantageous for parallel computation, which would enhance the system speed to overcome low-speed SOP devices. The butterflies can be computed recursively with reduced hardware, too. Therefore, it offers scalability of the amount of required hardware and computation speed. Appendix shows the definitions and a -Dim. constant geometry DCT and IDCT algorithm [16] for general radices and input lengths, which is the starting point of the algorithm development. [16] has a drawback of the differences in structure between recursive addition stages and butterfly stages, which requires separate hardware. In this paper, the recursive addition stage is revised to have the same structure as butterfly stage. Shuffling stages are decomposed to utilize symmetry for VLSI implementation. All the derivations can be also applied to IDCT algorithm for unified DCT/IDCT processor with identical hardware. First, for the general cases of radix p and N x N points, 1-Dim. DCT and IDCT matrices are derived to have identical submatrices in diagonal direction by kronecker product and data shuffling operations based on (1) (6) [16], as shown in Fig. 3(a)- (e). As shown in Fig. 3(a), the structure of the submatrices for A is irregular for the stages q (1 q t). Hence, it is hard to implement all decomposed matrices with identical chip structure. The methodologies will be explained in detail later to make the structure of decomposed matrices identical for any q (1 q t) stage by rearranging the elements at the same position of submatrix, which are repeated regularly in each hierarchical stage. Rearranging methods for A q (1 q t - 1) and A q (q = t) are described in Fig. 3(a)-(c) and Fig. 3(d), respectively since they are not identical because of different structures of submatrix D internally. The followings are the procedures to change A q to 1- Dim. DCT decomposed matrix I N/p T q, which has a structure of p x p submatrices arranged in diagonal direction, through V q and B q. First, D is changed to N x N decomposed matrix by duplicating elements inside D using kronecker product in (3). Then, elements are collected (reordering (I)) at the same position in every submatrix that is vertically arranged in D to get V, as shown with the thick lines in the middle of Fig. 3(a). And V is changed to B by column shuffling in (4) that collects elements (reordering(Ⅱ)) which are located at the same position in submatrix, diagonally arranged in V, as shown with the thick lines in Fig. 3(b). And the column shuffling in (5) collects (reordering (III)) the submatrices located at the same position inside every submatrix which is horizontally arranged in B, as shown at left side

5 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, of Fig. 3(c) inside thick line box. Finally, I N/p T q can be obtained with p x p submatrices arranged diagonally, which correspond to decomposed matrices of 1-Dim. DCT. The detailed procedures of reordering I, II, III will be discussed later. For the case of A t, according to (6), A t = I N/p D t as shown in the right side of Fig. 3(d), can be obtained having the same structure of 1-Dim. DCT decomposed matrix derived above without going through the intermediate stage matrices described later, which is different from the case of A q described above using (6). [ ( )] = ( ) [ ( )] where, Q = (Q ), = / ( /, / ) = / = / (, ),, (, ),, (, ) ( 1, 1) = / (1 1) D D, D = D, D = D, D D, / D,, = D,, D,, D,,, D,, (, ) ( / ) = cos (a diagonal matrix of order / ) j p 1, s p 1, n 1, 1 m p/, L = p (1) D,, (, ) πj(pn + 4sN/L + m 1) = cos pn/l ( / ) where, k = / = cos kπ N The detailed descriptions of the rearrangement procedures discussed above are as follows. For A q (1 q t - 1), for the reordering(Ⅰ), the hierarchical structure of submatrix D can be described as follows. [D q ] has 3 hierarchical stages as shown in Fig. 3(a). First, D q j (1 j p - 1) are arranged in column direction inside of [D q ]. And [D q j, m ] (1 m p/) are arranged diagonally in D q j. And [D q j, m, s ] (1 s p - 1) are arranged in row direction inside of each [D q j, m ]. And lastly, as shown in upper right part of Fig. 3(a), D,, (, ) = cos ( / ) / () n 1 are arranged diagonally from top left with n changing from to u - 1 for every [D q j, m, s ]. [X(k)] = Q (V V ) E [x(n)] Where, V = π(p, N) A π(p, N) ( ) = π(p, N) D I / 1 q t (3) By I L/p in (3) shown in the right side of Fig. 3(a), D of size pn pn is expanded into D L L q I L/p of size N N as shown in the middle of Fig. 3(a), by duplicating j, m, D s q (n, located inside L/p times, maintaining the n) matrix structure of D with respect to j, m and s. Then, by reordering(Ⅰ), as shown from the middle to bottom in Fig. 3(a), D q I L/p is changed to V q using π(p, N) according to Eq. (3). First, make a p x 1 submatrix by collecting elements located at the same position in every submatrix D q j arranged in column direction from top to bottom as shown in the thick box of D q I L/p. And by arranging p x 1 submatrices diagonally to make V q m, s and arranging V q m, s horizontally to make V q m (1 m p/) and arranging V q m diagonally, V q can be obtained finally maintaining the same structure of D q j, m, s (n, n) with respect to m and s collecting with respect to only j. [X(k)] = Q (B B ) S [x(n)] Where, B = V B = V π(p/, p /) I / B = (π(p/, N/L) I ) V π(p/, pn/l) I / S = (π(p/, N/) I ) E, L = p, 1 q t (4) In Fig. 3(b), V q m, s shown at the bottom of Fig. 3(a), is represented as (m, s) for simplicity. B q can be derived from V q by reordering(Ⅱ) as follows. As shown in (4), (π(p/,p /) T I N/p ) at the right side and both (π(p/, N/) I L ) and (π(p/, pn/l) T I L/p ) at left and right side, are used for the case of V t-1 (q = t - 1) and V (1 q t - 1), respectively. As shown inside the thick box at the left side of Fig. 3(b), for every submatrix V q m ((m, ) (m, s) (m, p-1)) arranged diagonally, N N p diagonal submatrix((1, ), p (, ) (/p, )) are rearranged in diagonal direction from the left top while maintaining s but changing m to

6 48 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL obtain B with p times of N N/p submatrices arranged horizontally. B = (I / T ) π(p, N) (5) By reordering (Ⅲ), as shown in Fig. 3(c), by π(p, N) -1 at the right side of (5), I N/p T q can be derived with p x p submatrix arranged diagonally. The p x p submatrix is obtained by collecting p x 1 matrix located at the same diagonal position in every N N/p submatrix in row direction sequentially, which is shown by the thick box in B. I / D = I / T where A = I / D (q = t) V = π(p, N) A π(p, N) ( ) = B = (I / T ) π(p, N) π(p, N) A π(p, N) ( ) = I / T π(p, N) π(p, N) A (π(p, N) ) = (I / T ) A = I / D = (I / T ) (6) For the case of A t, since the submatrix D t in A t is already composed of p x p matrices as shown in Fig. 3(d), using A t = I N/p D t = I N/p T t in (6), I N/p T t like I N/p T q can be obtained without going through V q and B matrices, which are to obtain A q described above. According to the procedures above, 1-Dim. DCT decomposed matrix can be obtained with the structure of N x N I N/p T q with p x p submatrix arranged diagonally for all q from the decomposed matrices A q and A t which have different structures. This means that 1-Dim. DCT can be computed by PEs with the same structure, with a butterfly corresponding to the p x p submatrix for radix p. Also, radix p can be adjusted to optimize the complexity of the PE that computes a butterfly, the number of butterflies in a butterfly stage and the degree of parallel processing accordingly, which will be discussed later. In Fig. 4, for the example of radix (p = ), 8 x 8 point (N = 8), the procedures are described to make the structure of submatrix D q of A q identical to I N/p T q for every q(1 q t)th stage by both kronecker product in (1) to (6) and the reordering described above. Since this paper is to implement a DCT/IDCT PE chip for p =, N = 8 8, first, the procedures are described to make a 1-Dim. DCT decomposed matrix identical for the case of p =, N = 8. The coefficients in the matrix shown in Fig. 4 are j, s, n, m in (1) for the case of p =, N = 8 and each case is shown inside the box in terms of k according to (). The cases of A q (1 q t - 1) and A t (q = t) above are described in Fig. 4(a)-(c) respectively. First, for the case of A q (1 q t - 1), as in (1), the internal matrix D q (1 q ) is composed of 8 x 8 D 1 and 4 x 4 D which is shown at the left side of Fig. 4(a). And the procedures are shown in Fig. 4(a) to obtain the 8 x 8 decomposed matrix by repeating internal elements of D 1 and D with I L/p in (3). The procedures after those are shown in Fig. 4(b) only for the case of q =. To illustrate the procedures to get V as an example, the top of Fig. 4(b) shows how to obtain x 1 matrix by collecting the elements at the same vertical location in every 4 x 8 submatrix located vertically by π(p, N) in (3) for the reordering (I) as in the bottom of Fig. 3(a) as well as by rearranging the x 1 matrix according to the locations of elements in the 4 x 8 submatrix. For the reordering (Ⅱ) in Fig. 3(b), because of m, the repetition count of V in diagonal direction equals to 1 for p = as shown in Fig. 4(b), therefore, there is only one V diagonally. Therefore, as shown from top right to left bottom matrices of Fig. 4(b), there is no change between V and B. For reordering (Ⅲ) as in Fig. 3(c), by π(p, N) -1 at the right side of (5) as shown at the bottom of Fig. 4(b), I 4 T can be obtained by collecting nonzero x 1 submatrices at the same location in every 8 x 4 submatrix arranged horizontally then by diagonally rearranging the x submatrix. In case of A t, as shown in Fig. 4(c), by (6), I 4 T 3 with x submatrix arranged diagonally can be obtained without the procedures to get V q and B q for the case of A q (1 q t - 1) explained earlier. Z = I / π(p, N) I B B I / π(p, N) I = I / π(p, N) I I / T π(p, N) I / T π(p, N) I / π(p, N) I = I / T T I / π(p, N) I π(p, N) π(p, N) I / π(p, N) I = K SF (7)

7 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, n =, 1,, 3 1 = k= / in Eq. (8) n =, 1,, 3 [j, s] = [, ] [, 1] 13 [1, ] [1, 1] Permutation for V t-1 in Eq. (9) = 1 Reordering(Ⅱ) = k= [j, s] = [, ] [, 1] Reordering(Ⅲ) π(p,n) -1 in Eq. (5) = k= 13 n =, 1 [1, ] [1, 1] = T 1 18 T T T / in Eq. (8) (b) = k= At q = = 1 k= 1 (a) = 3 = k= 4 4 k= = k= 4 4 Reordering(Ⅰ) π(p,n) in Eq. (8) 4 4 = (c) Fig. 4. Procedures to make I N/p T q from D q (1 q t) (a) Expansion process from D q to D q I L/p (1 q t - 1), (b) Column and row shuffling from D t-1 I L/p to I N/p T t-1 (q = t - 1), (c) Structure of A t and I N/p T t (q = t).

8 41 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL (a) (b) (c) Fig. 5. Procedures to obtain -Dim. DCT/IDCT from 1-Dim. DCT/IDCT (a) A way to make 1-Dim. IDCT decomposed matrices from 1-Dim. DCT, (b) -Dim. DCT decomposed matrices, (c) Procedures to obtain -Dim. IDCT decomposed matrices. Fig. 5 shows how to obtain 1-Dim. IDCT matrix based on 1-Dim. DCT matrix shown in Fig. 3(e). Based on this procedures, -Dim. DCT/IDCT matrix can be obtained using kronecker product and the permutation in (5) and (7) from 1- Dim. DCT decomposed matrix B with nonzero elements shown at the left side of Fig. 3(c). The methods to obtain -Dim. IDCT from -Dim. DCT are the same as obtaining the 1-Dim. IDCT from DCT which

9 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, is shown in Fig. 5(a). -Dim. DCT/IDCT matrix also have identical size decomposed submatrix diagonally for every stage, as 1-Dim. DCT/IDCT matrix shown in Fig. 3(e). All the 1-Dim. DCT matrices (I N/p T q ) shown in Fig. 3(d) have the same size and structure but the values of T q submatrix arranged diagonally are different for all the butterfly stages. Therefore, to show the differences, u submatrices in q th butterfly stage will be defined as T qn (1 n u). 1-Dim. IDCT matrix is composed of the inverse of (I N/p T qn ) as shown in Fig. 5(a), and the diagonal elements of (I N/p T qn ) -1 becomes inverse of submatrix T, which is the diagonal elements of DCT described above, using Appendix Def. 4. Comparing between DCT and IDCT in Fig. 5(a), they all have the same structure of submatrices arranged diagonally except the scaling factors of 1-Dim. DCT corresponding to the submatrix determinant T qn, the sign and the location of matrix elements, which are easily programmable in hardware. Therefore, both DCT and IDCT can be implemented with identical PE chips. In Fig. 5(b), the structure of N x N -Dim. DCT is shown based on the N x N 1-Dim. DCT matrix with the p p submatrix arranged diagonally. As explained in [16], to reduce the global interconnections of -Dim. DCT/IDCT system, like the decomposed matrix Z in (6), the decomposed matrix can be restructured with shuffling matrices at both left and right sides and kronecker product of B q pair, i.e. (B q B q ) at the center, which corresponds to the 1-Dim. DCT decomposed matrix shown in (4). And for the regularity of the PE chip as in the 1 dimensional case, with the permutation in (5) in the middle of (7), B can be restructured into I N/p T q which has submatrices diagonally. And using Lemma 1 and 3, decomposed matrix Z can be restructured to - Dim. DCT matrix I N /p T q T q and the shuffling matrices at the bottom of (7), which will be explained later in detail. Like the input shuffling at the left side of butterfly stage in Fig. 6, the shuffling matrices can be implemented by the interconnection between PE s. As shown in π(p, N) at the bottom of (7), for all the butterfly stages, all the decomposed matrices are composed of variables p and N, which are constant enabling the usage of identical chip structure. Therefore, it has the advantages for the semiconductor chip (a) (b) Fig x 8 constant geometry DCT and IDCT butterfly stage diagram (a) 8 x 8 DCT, (b) 8 x 8 IDCT. implementation only with repeated use of the same structure PE chip. Comparing -Dim. DCT matrix I N /p T q T q and submatrix of 1-Dim. DCT, I N/p T qn shown in Fig. 3(d), the -Dim. DCT matrix has the submatrix T qn T qv, kronecker product of T qn and T qv

10 41 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL pair, instead of T qn for 1-Dim. DCT in the diagonal direction. The p x p submatrix of -Dim. DCT matrix in q-th butterfly stage is defined as T qn T qv n, v u - 1 for two of all u cases of T q s (T qn and T qv ) as shown in Fig. 5(b), based on the definition of T qn which is the submatrix of 1-Dim. DCT matrix. -Dim. IDCT can be obtained from -Dim. DCT as shown in Fig. 5(c), as 1-Dim. IDCT is obtained from 1- Dim. DCT. (I N /p T qn T qv ) -1 for -Dim. IDCT, shown at the top of Fig. 5(c), can be represented as p x p submatrix (T qn T qv ) -1 arranged in diagonal direction as shown in the middle of Fig. 5(c), using Appendix Def. 3 as 1-Dim. IDCT case. And using Appendix Def. 3 again, each submatrix can be represented as T qn -1 T qv -1, kronecker product of p x p submatrix T qn -1 and T qv -1 1-Dim. IDCT shown at the bottom of Fig. 5(c). for The 1-Dim. DCT submatrix T qn shown in Fig. 3(d), can be implemented with LUTs. And also 1-Dim. IDCT with the same structure as 1-Dim. DCT, can be implemented with the same LUT with simple additional hardware for scaling, sign reversal and the shuffling of the elements. In the -Dim. IDCT submatrix, for T qn -1 T qv -1 shown at the bottom of Fig. 5(c), the determinants T qn and T qv can be realized by scaling factor and the rest can be implemented with LUTs without any change of coefficients in LUT, but only with the change of access order and the sign reversal of the coefficient. Therefore, -Dim. IDCT submatrix can be realized by -Dim. DCT submatrix PE. The detailed PE chip design structure will be described later. Proposed architectures based on contrast geometry DCT and IDCT algorithms and detailed VLSI implementation methodologies are described with radix, 8 x 8 DCT as an example. It can be easily applicable to IDCT, since DCT and IDCT have the same computation structure. 8 x 8 DCT can be derived as shown at the right side of (8), where each butterfly stage, Z q in (7) is decomposed into K q and SF. [X(k N + k )] = (Q Q) (I π(,8) I ) (K SF) (K SF) (K SF) (I π(,8) I ) (S S) [X(n N + n )] Where, Q = (Q Q Q ), Q = I / R π(n/l, pn/l) K = (I T T ), T = twiddle factor matrix SF = (I π(,8) I ) π(,8) π(,8) (I π(,8) I ) k, k, n, n =, 1,,,63 i = 1,, log 8, kronecker product (8) By regrouping the shuffling for the input and output as well as inter-butterfly stage, the inter-butterfly stage shuffling can be moved after each butterfly computation, K with OS and IS as in (9). [X(k N + k )] = (Q Q) OS (SF K ) (SF K ) (SF K ) IS [x(n N + n )] where, OS = (I π(,8) I ) SF, IS = SF (I π(,8) I ) (S S) (9) To devise an architecture to exploit constant geometry for both butterfly and recursive addition stage, firstly, the structure of K i SF (i th [X(k N + k )] butterfly stage) is analyzed. = RA RA OS (SF K ) (SF K ) (SF K ) IS [x(n N + n )] Where, RA = π(8,64) (I Q) (1) -Dim. IDCT can be represented as (11) by the inverse matrices of -Dim. DCT in (1). [x(n N + n )] = IS (K SF ) (K SF ) (K SF ) OS (Q Q) [X(k N + k )] (11) Thanks to constant geometry properties, SF (butterfly stage input shuffling) is the same for all butterfly stages as shown in (8). Both SF and SF -1 can be implemented by interconnections only. It can be implemented by programming cross points of horizontal and vertical for interstage interconnections. Only the twiddle factors in T T inside K i are different for butterflies, which can be programmable which is discussed later. The inverse matrices of constant geometry DCT in (9) are constant geometry, too. A butterfly stage (SF K i ) for 8 x 8 DCT in (9) is shown in Fig. 6. To compute the butterfly and recursive addition stages with identical PE s, those stages are made identical to be implemented with a simple butterfly structure. As shown in (9) and Fig. 6, in case a butterfly

11 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, is computed by PEs serially with inputs fed by shift register, SF, the shuffling for 64 input data, has the disadvantage of requiring complex hardware with large latencies, because input data needs to be shuffled with the large distances. Therefore, stage by stage pipeline is used with shared bus to feed the shuffled inputs to PE. The computation structure of the butterfly stage is described in Fig. 7(a). An identical butterfly in a -Dim. DCT and IDCT butterfly stage can be computed by 4 x 4 submatrix with 4 input data. As shown in the above, 4 x 4 submatrix T qn T qv in I 16 T q T q at the top of Fig. 7(a), a row is composed of identical non-zero element values except regular pattern of sign changes for all butterfly stages. And each of the 4 x 4 submatrix is composed of 1 additions and 4 multiplications. The regular pattern of sign changes of 4 x 4 matrix can be formulated with S and S 1 as in Table 1 according to the order of x matrix, T qn and T qv in each stage as shown at the bottom of Fig. 7(a). The methodologies to restructure T qn T qv with I p T qv and I p T qn (p = ) and the shuffling according to (1) are presented to take advantage of repeated hardware usage utilizing the butterfly structure. Therefore, 4 x 4 I T qn and I T qv can be restructured as two regular x T qn, T qv using Lemma 1 and. x T that computes a 1-Dim. DCT butterfly stage, can be represented as a butterfly that computes input data with identical matrix elements except the sign difference for each butterfly stage as shown in the matrix in the middle of Fig. 7(a). In Fig. 7(b), the structures of 4 x 4 and x butterflies are shown based on the butterfly computation matrix as above. As a conclusion, the butterfly to compute 4 input data is composed of the regular repetitions of two butterflies to compute inputs and the shuffling as shown in (1). And each input butterfly for 1-Dim. DCT has computation amount of additions and 1 multiplication. As shown at the bottom of Fig. 7(a), the final x butterfly can be realized by identical PE structure except one regular sign change according to S, S 1 in Table 1. In this paper, based on this regularity explained above, a methodology will be presented to compute the DCT/IDCT with repeated use of identical PE, a minimum computation unit for x butterfly for input data. Table 1. Sign patterns of 4 4 matrix T T = (, ) T T (, ) = (, ) T T (, ) = (, ) T T (, ) = T (, ) T (, ) = (, ) (, ) T (, ) T (, ) (a) (b) Fig. 7. Butterfly and computation structure (a) x, 4 x 4 matrices and sign patterns, (b) Butterfly computation. = ( T ) (, ) ( T ) (, ) (1) Larger (Smaller) radix increases (decreases) parallelism in a butterfly stage with higher (lower) throughput. The PE stage to compute a butterfly stage can be recursively used to compute multiple butterfly stages saving hardware with lower throughput and vice versa. Detailed performance evaluations will be discussed in V. Recursive addition stage, (Q Q) in (9) is composed of irregular structures of additions and complex I/O patterns as shown in Fig. 8. This makes DCT/IDCT system implementation with redundancies complex, requiring multiple types of PE`s for butterfly and recursive addition stages. Therefore, (Q Q) is

12 414 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL Fig. 8. Structure of recursive addition stage. decomposed into two matrices with identical structure as shown in (13) to realize constant geometry for entire DCT stages and to make (Q Q) to be computed by the same PE for the butterflies. = ( )( ) = (8,64) (8,64) ( ) (8,64) (8,64) ( ) = (8,64)( ) (8,64)( ) = (13) ( ) = (14) where, Q = Q.5.5 = Since all of the coefficients of Q are, +1, -1, +, -, 4 and RA is the shuffled version of Q coefficient rows, RA is basically composed of n with integer n. For recursive addition for IDCT, (Q Q) -1 = (RA RA) -1, where RA -1 is composed of only -n with integer n. Therefore, the recursive additions for both DCT and IDCT need only adders and shifters without a multiplier. The final 8 8 DCT algorithm used for PE chip design is formulated as (1). Taking the inverse of (1), 8 x 8 IDCT algorithm can be derived as (11). As shown in (1), the same PE can implement all of DCT/IDCT butterfly as well as recursive addition stage computations. It is particularly useful and efficient for redundancy replacement to alleviate low yield problem of SOP processing technologies. To make the butterfly stage and the recursive addition stage identical in implementation structure, the following modification is made. As shown in (14), all of them have 8 fold symmetries which will be utilized in the following section. Inverses of butterfly and recursive addition computation have 8 fold symmetries, too. K = I (T T ) = I (I T T ) (15) RA = (I Q) (16) The recursive and inverse of recursive addition for - Dim. DCT and IDCT are composed of two identical stages of RA RA in (13) and RA -1 RA -1 in (14), respectively. For simplicity, only one stage 64 x 64 RA and RA -1 are described. To implement both butterfly stage and recursive addition stage with identical PE s, RA and RA -1 are restructured like x butterfly structure discussed above. As shown at the left side of Fig. 9(a), RA in (16) is composed of groups, each has 8 or 4 of 3 types (1 x 1, x 1 and 3 x 1) computations. As shown in the thick box in Fig. 9(a), RA -1 is composed of 8 repetitions of groups composed of 3 types of computation, 1 x 1, x 1 and 4 x 1. Therefore, it is enough to describe the restructuring method for a group or computations which is repeated in a stage or a group, respectively. Fig. 9(b) and (c) show the structure of the computation unit, 1 x 1, x 1 and 3 x 1 for RA and 1 x 1, x 1 and 4 x 1 for RA. As shown in Fig. 9(b) and (c), to compute RA and RA -1 with PE s computing x butterfly which is a computation unit in a butterfly stage, the first two inputs are computed with a PE and the two of the rest inputs are computed with another PE and so on by multiple pipeline stages sequentially. The first type, 1 x 1 computation can be realized just by interconnection for feeding forward input to output without any PE. The second type, x 1 computation uses only one output of x butterfly PE. As shown in Fig. 9(b), the third type, 3 x 1 computation uses two x

13 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, (a) (b) (c) Fig. 9. Recursive addition computation structure (a) Structure of recursive addition and inverse recursive addition matrix, (b) Recursive addition computation, (c) Inverse recursive addition computation, (d) Structure of physical layout of RA and inverse RA. (d) butterfly stages in cascade, which requires PE s. Similarly, the fourth type, 4 x 1 computation uses three stages in cascade, which requires 3 PE s. Also, as shown in Fig. 9(b) and (c), the coefficients required in RA and RA -1 computations are the type of which can be implemented only with simple shifter that will be explained later. Therefore, RA and RA -1 stages shown in Fig. 9(a) can be implemented with identical PE utilizing the constant geometry. In Fig. 9(d), utilizing constant geometry, all the butterfly stages as well as RA and RA -1 can be computed with 3 identical PE s in a stage. One PE for each butterfly and a couple of PE s for RA and RA -1 are allocated with vertically running cross-point interconnections to feed the signals from any points to any points between stages horizontally. Due to the pipeline computation structure explained above, each RA and RA -1 requires 48 PE s compared to 3 PE s for a butterfly stage, which causes difference in height in chip layout. Therefore, as shown in Fig. 9(d), each RA and RA -1 computation is divided into two parts, 1 and. Each part is composed of 3 PE s for the butterfly stage. In case of RA shown in Fig. 9(d), from bottom to top, sixteen groups of two PE s for 3 x 1, eight PE s for x 1 are placed and 8 feed-forward lines for 1 x 1 computation are implemented at the place that corresponds to unused eight PE s. And in case of RA -1, to implement three PE groups for 4 x 1 computation, a vertical output to input feedback interconnections are used perpendicular to RA -1 part shown in Fig. 9(d). x 1 and 1 x 1 computation can be implemented similarly as RA. As shown in Fig. 6, SF and SF -1 have the structure of block matrix repetition every 4 or 8 rows for output pixels. As shown in (8) and (9), both the butterfly

14 416 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL stage and the recursive addition stage have (I 8 ). (I 16 T T ) and Q are computed 8 times per stage for butterfly and recursive addition. All the DCT/IDCT decomposed matrices for the shuffling and butterfly computation have 8 fold symmetries in each stage. Also, the architecture for each stage as shown in Fig. 9 has 8 identical sections with regularity. Therefore, for 8 x 8 DCT and IDCT, only 8 rows are enough to be investigated. The structure of computations with the array of identical PE s and the shuffling interconnections will be discussed later for butterfly and recursive additions. Each PE computes a radix butterfly and a recursive addition for DCT and IDCT. A PE has 4 inputs and outputs for a butterfly and for the recursive addition. To make the throughput the same, the recursive addition needs two PE s in parallel as shown in Fig. 7. (a) (b) III. 8 X 8 CONSTANT GEOMETRY DCT VLSI ARCHITECTURES Based on constant geometry DCT/IDCT algorithm in (1), 8 x 8 DCT/IDCT overall system and PE chip implementations are discussed. 1. Overall Architecture 8 8 DCT/IDCT system is composed of butterfly stages SF K i and recursive addition stage, RA. Both of the input shuffler (IS in (1)) and output shuffler (OS in (1)) are only composed of interconnections with higher yield compared to semiconductor devices. As explained in the previous section, there is 8 fold symmetry in the butterfly and the recursive addition stage. 8 PEs or less are used per stage to exploit the symmetry. Also, PE arrays for a stage can be recursively used to reduce hardware. Based on constant geometry properties, an identical PE can compute two outputs for a radix butterfly for DCT or IDCT as well as recursive addition for RA or RA -1 as shown in Fig. 9 with programmable interconnections. As shown in Fig. 9, each PE computes the butterflies by a multiplier and by shifters to compute n for RA and RA -1 as shown in Fig. 9. Considering speed and power requirements, the number of PE s can be optimized for the required degree of serial and parallel computation. As shown in Fig. 9(a), PE computes pixel outputs from maximum 7 pel inputs with 1,, 16 and (c) Fig x 8 Constant Geometry DCT architecture based on fold symmetry (a) PE, (b) PE I/O, (c) PE array architecture to compute DCT and IDCT butterfly and recursive addition. 3istances away. As shown in Fig. 1(b), PE column array computes a butterfly or recursive addition stage. BCU (Bus Control Unit) programs the interconnections between vertical bus and PE input ports accordingly. The number of PE columns and PE s in the PE column can be optimized according to application areas. Maximum 7 pels are required to compute two outputs of a butterfly stage (SF K i and RA for DCT, K i -1 SF -1 and RA -1 for IDCT). Considering two rows (two outputs) of the matrices above, inputs can be fed into a PE by the programmable interconnections with programmable row start point and pel location offset inside one of 8 fold sections as shown in Fig. 9. Considering low yield of SOP, additional PEs can be used a stage for redundancy

15 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, IV. VLSI IMPLEMENTATIONS VLSI implementation details including the distributed arithmetic multiplier using ROM are discussed based on the architecture in Ⅲ. 1. Multiplication ROM Fig. 11. Programmable PE architecture. replacement. For example, input pels are 1,, 16 and 3 pels away from row start pel location and the row start pel locations are, 4, 8, 9 from the start locations of 8 fold section, which can be easily programmable by the bus control unit for 7 buses in Fig. 1. Furthermore, redundancy PE can be easily implemented using the same bus architectures as shown in Fig PE Architecture Fig. 11 shows a detailed PE architecture to compute both butterfly and recursive addition stage. A PE consists of distributed arithmetic multiplier to multiply twiddle factors to input pixels, shifters for n type coefficient multiplication for RA, RA -1 and adders (subtractor) to sum up the results afterwards. The same adders (subtractor) are used for butterfly and recursive addition stage without any extra hardware. There are differences only in input shuffling bus, which doesn t increase hardware complexity, either. As shown in Fig. 11 for butterfly computation, (x + x 1 ), (x - x 1 ) and c(x - x 1 ) are computed by adder1, adder and distributed arithmetic multiplier, respectively. Shifter shifts input pixels and adder1, adder and adder/subtractor3 add maximum four shifted pixels for recursive addition and also for both K i, SF, RA for DCT and SF -1, K -1 i, RA -1 for IDCT of (1) as shown in Fig. 11, respectively. It can be pipelined to get higher throughput. Each BU receives maximum 7 inputs and computes a butterfly and recursive addition stage to generate outputs. The multiplication for the recursive additions can be done with simple shifts since all the entries are multiples of two as discussed before. The multiplication by n can be pre-programmed for RA and RA -1. A multiplier for inner product of DCT/IDCT needs the most hardware in a PE, which needs to be minimized for low yield SOP. The multiplication ROM is designed to reduce PE chip area utilizing the fact that the number of necessary twiddle factors for DCT/IDCT are limited as follows. A. Accuracy Estimation Since twiddle factors are expressed in limited word length, inaccuracy problem may occur. The required word-length is simulated with the word-length of ROM and datapath as variables with C language. Fig. 1 shows the MSE. As shown in Fig. 1, 15 bit and 13 bit word length for datapath and ROM satisfy CCITT standard [1] sufficiently. For IEEE 118, only 13 bit and 11 bit word lengths for datapath and ROM are required. B. Modified Partial Sum Method Two methodologies to reduce multiplication ROM are presented to save chip area. First, the partial sum method [] is modified to use dual port ROM instead of two separate ROMs. Second, sign processing method using the characteristics of 's complement is used. For formal derivation, the multiplication with s complement input and output is formulated as (17). ROM i (c, x (a:b) ) is the ROM outputs for the partial multiplication results of twiddle factor c to input x bits from a th to b th bit. The bits from a th to b th bit are used for the ROM addresses. (17) shows typical partial sum technique. The input bits are divided into upper and lower bits in order to reduce ROM size, then the output of each ROM is added to produce the final result. However, since ROM for sign bit needs to be included in ROM 1 or ROM in (17), two different ROMs are required. ROM size increases with increasing word- length. Therefore, in order to decrease ROM size and to use the same coefficients for the dual port ROM, (17) is modified as (18).

The addresses of ROM are decided according to input bit patterns. Also, since each ROM includes completely identical coefficients, dual port ROM efficiently halves ROM size utilizing linearity.

16 418 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL, (:( 1)/ 1) + (, (:( 1)/ 1) ( 1)/ ( = 1 = ), (:( 1)/ 1) (, (:( 1)/ 1) ( 1)/ ( 1 = 1) (19) (a) (19) shows that the multiplication can be replaced with the addition or subtraction of ROM outputs if ROM bank is selected according to the coefficients. The addresses of ROM are decided according to input bit patterns. Also, since each ROM includes completely identical coefficients, dual port ROM efficiently halves ROM size utilizing linearity. Dual port ROM can be easily implemented compared to RAM since ROM doesn`t need write circuits and hardwired internal data can drive dual port easily. If ROM wordlength in conventional partial sum methods and proposed one are and, respectively, the ratio of ROM cell count can be shown as (). = = ( : ) + ( : ) ( )/ 1 = c x ( : ) + ( : ) ( )/ 1 + ( )/ ( ( )/ : ( )/ ) =, ( : ) +, ( : / ) + (, ( )/ : ) (17) ( )/ = ( : ) ( : ) n-1 / ( )/ = n-1 /-1 ( )/ j+ n-1 /:j+ n-1 / ( )/ ( ( )/ : ( )/ ) (b) Fig. 1. MSE Evaluation (a) Data word length vs. MSE, (b) ROM word length vs. MSE. where, ( ) = ( : ) (18) From (17) and (18), the multiplication result y can be shown as (19). ROM cell count ratio = ( )/ ( )/ ( )/ ( )/ ( )/ () In case of w p w mp, about 3% of ROM cells can be saved. Furthermore, since usually w mp <w p, the efficiency can be greatly increased. C. Multiplication ROM Implementation. Table shows the summary of PE chip design. Fig. 14 shows the photograph of PE chip with low transistor counts. Assuming.18 um processing technologies, the chip area is very small compared to existing DCT/IDCT chips [9, 19]. Even with low yield SOP processing technologies, the PE can be efficiently implemented. Fig. 13 shows the structure of the multiplication ROM. The input conversion unit divides N input bits into (N + 1)/ bit and (N - 1)/ bit, and the upper (N + 1)/ bit is converted into (N 1)/ bit by the sign unit. The resulting two (N - 1)/ bit inputs are sent to the dual port ROM. If the value is positive, the sign unit generates the bits excluding the most significant sign bit, and if the value is negative, it generates s complement of the value. Since 8 x 8 DCT/IDCT needs total 7 coefficients, the dual port ROM uses 3-bit bank control signal to select one of seven coefficients according to the location of PE as well as the input signals shown in Fig. 13,

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 Table. Summary of PE chip Transistor count 3K Chip size 1867λ X 534λ (Including pads) (for λ =.5 um case:.93 um X 1.7 um = 1.

N x N point DCT Throughput p rs N logp N Number of Multipliers Area X time Area X Time rs O(AMULT N logp N ) p O(AMULT N logp N p rs ) Fig. 15. Comparisons of Area X Time. Fig. 13.

11 receives the two outputs of ROM and if the sign bit is positive, (negative), addition (subtraction) is carried out.

17 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 Table. Summary of PE chip Transistor count 3K Chip size 1867λ X 534λ (Including pads) (for λ =.5 um case:.93 um X 1.7 um = 1.18 um) (for λ =.18 um case:.34 um X.46 um =.15 um) Throughput 4 pixels/cycle 419 Table 3. Performance evaluation for proposed -Dim. N x N point DCT Throughput p rs N logp N Number of Multipliers Area X time Area X Time rs O(AMULT N logp N ) p O(AMULT N logp N p rs ) Fig. 15. Comparisons of Area X Time. Fig. 13. Structure of muiltiplication ROM. Fig. 14. PE chip photograph. which is only difference inside a PE. Adder/subtractor shown as adder3/subtractor3 in Fig. 11 receives the two outputs of ROM and if the sign bit is positive, (negative), addition (subtraction) is carried out. The adder (subtractor) adds (subtracts) ROM output shifted right by (N - 1)/ bits. As discussed above, other twiddle factors can be synthesized by the multiplication of basic twiddle factors and the number of ROM banks can be minimized further. V. PERFORMANCE ANALYSIS AND ADVANTAGES The performances of the proposed general radix p, N x N point constant geometry DCT/IDCT system are analyzed as shown in Table 3. Assume r PEs are assigned per 1/s of total butterfly stages, logpn. Also, a PE is assumed to be pipelined. Since the recursive addition requiring only shifts and additions, consumes less time compared to butterfly computation, it is excluded in throughput analysis. If one PE computes all the butterflies and recursive addition stages, then r = s = 1. For completely parallel case, r = N/p, s = logpn and N outputs are computed every clock. As r and s increase, the degree of parallel processing is increased. Other architectures to be compared fairly are chosen according to the criteria of no transposition memory [3]. Other architectures except [3] have irregular or global interconnections making VLSI implementation difficult. Fig. 15 shows Area X Time (performance metric for speed sensitive application areas) of the radix p, N x N DCT/IDCT system proposed in this paper compared to other architectures. Both show that the proposed DCT/IDCT system has one of the lowest values among other architectures. Performance values become smaller as radix and the degree of parallel processing increases and the degree of recursive processing decreases. Also, they can be optimized using various radix p, the degree of parallel processing, r and s as well as the degree of recursive processings for wide range of computation performance vs. hardware amount trade off unlike other existing architectures. The performances of PE chip are as follows. Each PE

chip processes two radix butterfly stage or recursive addition input data in single clock cycle and 4pixels/cycle throughput can be obtained.

18 4 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL VI. CONCLUSIONS Fig. 16. Performance evaluations of proposed DCT/IDCT system according to radix, PE count and degree of recursiveness for Area X Time. chip processes two radix butterfly stage or recursive addition input data in single clock cycle and 4pixels/cycle throughput can be obtained. The advantages of proposed DCT/IDCT processor can be summarized as follows. 1. Only simple hardware without transposition memory or multipliers are required.. Although existing DCT systems such as [7] need separate hardware for recursive addition in addition to butterfly, the structure of each butterfly stages, recursive addition stage and interconnection between stages are all the same in the proposed DCT/IDCT system. Only duplicated use of simple PE chip hardware with high yield programmable interconnection is sufficient. This also eases redundancy replacement for fault tolerance even in low yield SOP processing technologies. Furthermore, by programming the order of butterfly and recursive addition stage with interstage interconnections, DCT and IDCT computation can be multiplexed with the same hardware, which reduces hardware amount greatly for mobile codec systems. 3. Based on the regularity, the degree of parallel processing in a stage as well as the degree of recursive processing in entire butterfly stages can be easily optimized with respect to speed, hardware amount according to processing technology yield and application areas as shown in Fig Compared to the partial sum used for existing multiplication ROMs, fewer ROM can perform the same multiplication necessary for DCT/IDCT using the proposed dual port ROM that can be decomposed into two identical parts. VLSI algorithm, architecture and PE chip design methodologies are introduced to implement all the butterfly and recursive addition stages of both DCT and IDCT with only a single PE chip and interconnection programming between stages. It is suitable for image codec system on panel with efficient redundancy in case of low yield SOP processing technologies. Speed vs. hardware amount based on the degree of serial-parallelrecursive processing can be easily optimized. Distributed arithmetic multiplication dual part ROM is proposed to save chip area. Experimental low transistor count 8 X 8 DCT/IDCT PE chip is designed to illustrate the proposed algorithms and architectures. Finally, performances are analyzed with the advantages. ACKNOWLEDGMENT I appreciate Jiseong Yoon, Junkyung Kim and Jeehwan Kim for their efforts on matlab coding and data analysis. APPENDIX Def. 1. A kronecker product is defined for two matrices A, B. =,,,,, Def.. If N = pm, the shuffling π(p, N) is defined as follows. (, ) (), (1),, ( 1) = (), ( ), ( ),, ( 1), (1), ( + 1), (), (1),, ( 1) (, ) = ( (), ( ), ( ),, ( 1), (1), ( + 1), ) A constant geometry DCT algorithm [16] proposed by author is composed of identical butterfly stages for general DCT input lengths and radices as shown in (1). -Dim. constant geometry DCT algorithm with input length N, radix p can be shown as (1) by using the definitions in appendix.

19 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, [ ( + )] = ( ) / (, ) ( ) / (, ) ( )[ ( + )] Where, = / / (, ) (, ) (, ) / (, ) = ( ), = / ( /, / ), = (1) R q is a square matrix of order pn/l, to repeat the additions for the following N/p outputs to apply for all the shuffled outputs. [[ ( + )] + ( )] = ( /, ) [ ( )] Where, =, = 1,,, Also, I N is a unit matrix of order N N, T q is a twiddle factor matrix of order q q and S is a shuffling matrix as follows. ( ) = [ ( ),, ( ),, / ( )] [ ( )] = [ (), (1),, ( / 1)] Where, ( ) = ( + 1) / 1 ( ) = ( ) / / 1 1 /, / 1,,,,, = By taking the inverse of (1), IDCT can be derived as shown in (). [ ( + )] = ( ) / (, ) ( ) / (, ) ( ) [ ( + )] () Def. 3. (Lemma 4..1 in [8]) The product of two Kronecker products yields another Kronecker product: ( )( ) =,,,,,,, (3) Def. 4. (Corollary in [8]) If Α, Β are non-singular then (A B) = This property follows directly from the mixed product property Def. 3. Lemma 1 (, )( ) (, ) = / REFERENCES [1] H. G. Walton, M. Brownlow, Invited Paper: The System Integrated LCD, Euro Display 5, pp [] Y. Kida, Y. Toyoshima, et al., Development of System-on-glass Display for Narrow Frame width with Integrated Color Management Function, Euro Display 5, pp [3] S. Y. Lee, J. H. You, Design Considerations on Partition of SOP, CMOS and PCB technologies for Mobile Display System Implementation, IMID 7 Digest. [4] Y. Matsueda, Y. S. Park, et al., Trend of System on Panel, IMID 5 Digest, pp [5] N. Karaki, T. Nanmoto, et al., A Flexible 8-bit Asynchronous Microprocessor Based on Low- Temperature Poly-Silicon (LTPS) TFT Technology, SID 5 Digest, pp [6] T. Nishibe, H. Nakamura, Value-Added Circuit and Function Integration for SOG (System-on Glass) Based on LTPS Technology, SID 6 Digest, pp [7] M. Murase, Y. Kida, et al., Narrow-Frame System-On-Glass Liquid Crystal Display with Low Voltage Interface Circuitry, IDW 6, pp [8] M. Verderber, A. Zemba, A. Trost, HW/SW Codesign of the MPEG- Video Decoder, Proceedings of Int. Parallel and Distributed Processing Symposium, Apr., 3. [9] D. Gong, Y. He, and Z. Cao, New cost-effective VLSI implementation of a -D discrete cosine transform and its inverse, IEEE Trans. Circuits and System on Video Technology, vol. 14, no. 4, pp , Apr., 4. [1] A. Setyono, A. Md. Jahangir, C. Eswaran, "Development and Implementation of Compression and Split Techniques for Multimedia Messaging Service Applications, International Journal of Computer Theory and Engineering, Vol. 6, No. 1, Feb., 14. [11] J. H. Kim, "Mobile terminal capable of controlling various operations using a multi-fingerprint-touch input and method of controlling the operation of the mobile terminal." U.S. Patent , issued June 3, 14. [1] K. Gaedke, J. Franzen, P. Pirsch, A Fault-tolerant

4 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL DCT-Architecture based on Distributed Arithmetic, IEEE Int. Symp. on Circuits and Systems, pp.

20 4 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL DCT-Architecture based on Distributed Arithmetic, IEEE Int. Symp. on Circuits and Systems, pp , May, [13] A. Hatim, S. Belkouch, T. Sadiki, M. M. Hassani, "Efficient hardware architecture for direct D DCT computation and its FPGA Implementation, 5th International Conference on Microelectronics (ICM), Dec., 13. [14] Q. Shang, Y. Fan, W. Shen, S. Shen, Single-Port SRAM-Based Transpose Memory With Diagonal Data Mapping for Large Size -D DCT/IDCT, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol., No. 11, Nov., 14. [15] S. F. Hsiao, W. R. Shiue, A New Hardware- Efficient Algorithm and Architecture for Computation of -D DCT on a Linear Systolic Array, IEEE Trans. On Circuits and Systems for Video Technology, Vol. 11, No. 11, Nov., 1. [16] J. Kwak, J. You, "1-D and -D Constant geometry fast cosine transform algorithms and architectures," IEEE Trans. on Signal Processing, vol. 47, No. 7, pp. 3-34, July, [17] J. You, Unified Constant Geometry DCT/IDCT for Image Codec System on Display Panel, IEICE Trans. On Fundamentals of Electronics, Communication and Computer Sciences, Vol. E95-A, No. 1, Dec., 1. [18] B. E. Caroline, G. Sheeba, J. Jeyarani, F. S. Roseline mary, "A Reconfigurable DCT/IDCT architecture for video codec : A Review", Third International Conference on Computing, Communication and Networking Technologies (ICCCNT), July, 1. [19] S. Ghosh, S. Venigalla, M. Bayoumi, "Design and Implementation of a D-DCT Architecture using Coefficient Distributed Arithmetic," IEEE Computer Society Annual Symp. on VLSI New Frontier in VLSI Design, 5. [] M. T. Sun, T. C. Chen, A. M. Gottlieb, "VLSI implementation of 16 x 16 Discrete Cosine Transform," IEEE Trans. on Circuits and Systems, vol. 36, No. 4, pp , Apr., [1] CCITT SGXV Working Party XV/1 Specialists Group on Coding for Visual Telephony Document 584, Nov., [] J. I. Guo, C. M. Liu, C. W. Jen "A New Array Architecture for Prime-Length Discrete Cosine Transform, " IEEE Trans. on Signal Processing, Vol. 41, No. 1, Jan., [3] C. Wang and C. Chen, High-throughput VLSI Architectures for the 1-D and -D discrete cosine transform, IEEE Trans. Circuits System on Video Technology, vol. 5, No. 1, pp. 31-4, Feb., [4] N. Cho and S. Lee, Fast algorithm and implementation of -D Discrete cosine transform, IEEE Trans. on Circuits System, vol. 38, No. 3, pp , Mar., [5] W. Ma, -D DCT systolic array implementation, Electron. Lett, vol. 7, pp. 1-, Jan., [6] C. T. Chiu and K. J. R. Liu, Real-time parallel and fully pipelined two-dimensional DCT lattice structures with application to HDTV systems, IEEE Trans. on Circuits and System on Video Technology, vol., pp. 5-37, Mar., 199. [7] N. Cho, D. Yun, S. Lee, "On the regular structure for the fast -D DCT algorithm," IEEE Trans. on Circuits and Systems," Vol. 4, No. 4, Apr., [8] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. CambridgeUniversity Press, Cambridge, Jaehee You received his B.S. degree in Electronics Engineering from Seoul National University, Seoul, Korea, in He received his M.S. and Ph.D. degrees in Electrical Engineering from Cornell University, Ithaca, NY, in 1987 and 199, respectively. In 199, he joined Texas Instruments, Dallas, TX, as a Member of Technical Staff. In 1991, he joined the faculty of the School of Electrical Engineering, Hongik University in Seoul, Korea, where he is now supervising the Semiconductor Integrated System Laboratory. He has served as an Executive Director of Drive technology and System research group of Korean Information Display Society. He was a recipient of Korean Ministry of Strategy and Finance, KEIT Chairman Award for Excellence in 11. He has worked as a technical consultant for various companies such as Samsung Semiconductor, SK Hynix, Global Communication Technologies, P&K, Penta Micro, Nexia device and Primenet. His current research interests include integrated system design for display image signal processing and perceptual image quality enhancement.

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen