DCT/IDCT Constant Geometry Array Processor for Codec on Display Panel

Size: px
Start display at page:

Download "DCT/IDCT Constant Geometry Array Processor for Codec on Display Panel"

Transcription

1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 ISSN(Print) ISSN(Online) DCT/IDCT Constant Geometry Array Processor for Codec on Display Panel Jaehee You Abstract Algorithms, architectures and 8 X 8 processing element chip design are discussed based on constant geometry -Dim. DCT and IDCT. Both DCT and IDCT can be computed with the same hardware for image codec applications. An array of identical PEs is used for butterfly as well as recursive addition stages only with programming of regular interconnections between stages for fault tolerance of system on panel or flexible display panel. Also methodologies to optimize computation speed and amount of required hardware are discussed. An efficient distributed arithmetic ROM for coefficient multiplication is proposed to minimize chip area and to facilitate the PE programming. Chip area, computation time, with image qualities are evaluated with advantages. Index Terms DCT, IDCT, system on panel, constant geometry, fault tolerance, distributed arithmetic I. INTRODUCTION Nowadays, mobile devices such as smart phones need higher resolution image requiring higher data rate with large power consumption. System on Panel (SOP) [1] or Glass(SOG) [] approaches are being investigated to integrate all the peripheral systems on display panel eliminating overheads due to interconnections. These make high bandwidth image data transfer possible from peripherals [3] with on chip interconnections. Also Manuscript received Aug. 3, 17; accepted Jul. 3, 18 School of Electronic and Electrical Engineering, Hongik University 7-1 Sangsudong, Mapogu Seoul, Korea jaeheeu@hongik.ac.kr weight and system size due to cables, connectors and controller can be reduced. For the flexible display panel, total integration of all the peripherals is necessary. However, SOP has an inherent drawback on the yield of processing technologies such as LTPS and organic semiconductors. It can be mitigated with fault tolerance such as redundancies. Mainly immediate peripheral circuitries such as data driver, gate driver and DC-DC converter have been considered to be integrated on display panel so far [4]. An experimental microprocessor with small scale frame buffer and audio circuits integrated on display panel have been reported [1, 5, 6]. These results show that more systems such as image codec or frame memory are expected to be integrated on display panel in the future [7]. Considering low yield processing technologies, power, size and weight constraints, DCT and IDCT are to be first to be considered for SOP image codec, since [8] found that -Dim. discrete cosine transform (DCT) implementation with ASIC reduces energy use by 36.8% with the factor of ten to thirty compression rates. Since most of mobile devices need image codec for multimedia messaging, it is essential to implement both DCT and IDCT [9, 1]. Fig. 1 shows a comparison between conventional display system and the proposed SOP with on-chip DCT/IDCT. Next to LCD panel backlight and CPU, offchip bus communication and frame buffer are the major sources of power consumption [9, 1], which need to be minimized by on-chip implementation. If, in addition, IDCT can be used for frame memory data decompression, it would be possible to reduce the frame memory bus data rate, the amount of frame buffer and the corresponding power consumption by a factor of ten-to-

2 44 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL Fig. 1. System on panel with compressed image storage and transfer. thirty, which is compression rate based on DCT. Clearly, this would be beneficial especially for mobile display systems [11]. Low processing technology yield problem in SOP can be alleviated by fault tolerance capability. The existing DCT and IDCT architectures have difficulties in realizing fault tolerance and in obtaining optimized speed and the amount of hardware according to application areas due to irregular architecture [1]. Most of D DCT or IDCT VLSI algorithms can be categorized as D direct DCT [13] and D DCT with row column transposition [14]. Both of them have irregular interconnection structure among computation units, not suitable for the fault tolerance. Systolic array architectures [15] require processing elements interconnected with irregular data delay stages to control data flow timing. To solve these problems, the author proposed a high speed DCT algorithm to make it possible to compute DCT based on decomposed matrices with the same structure for arbitrary radices [16]. In [17], constant geometry DCT and IDCT VLSI algorithms are presented to facilitate fault tolerance and testability by the author. However, still two different structures for both butterfly stage and recursive addition stage need to be made identical to be implemented with identical hardware. Distributed arithmetic [18, 19] can be used for multipliers to reduce hardware, which is advantageous for low yield processing technologies. In [], bitserial/bit parallel architecture are used to implement 16 x 16 DCT chip with distributed arithmetic for hardware amount optimization. However, it is also lack of modularity and flexibility, which is not suitable for redundancy replacement. In this paper, VLSI architecture and its implementation methodologies are introduced for a processing element programmable for both DCT and IDCT with constant geometry, suitable for SOP. In chapter II, a DCT and IDCT constant geometry algorithm with identical butterfly and recursive addition stage is described. A DCT/IDCT VLSI architecture and VLSI implementation methodologies with experimental 8 x 8 processing element chip design results are described in chapter III and VI, respectively. Performance analyses are discussed with advantages in chapter V. II. -DIM. CONSTANT GEOMETRY DCT/IDCT ALGORITHM As shown in Fig. 1, as one of possible SOP display panel systems, a decoder with IDCT can be integrated on the display panel so that only the compressed image data are transferred to the display panel saving more than 5% of energy, which will be discussed in detail later. Also, an on-chip embedded frame buffer is used with an on-chip bus for the transfer of large amount of uncompressed frame data. The size of the frame buffer can be reduced by minimizing the time difference between the decompressions and the fetching of the decompressed data by using an expanded line buffer. To implement the DCT and IDCT in display panels, identical PE array can alleviate the low yield problem with redundancies. Therefore, constant geometry architecture for both butterfly and recursive addition stage can make DCT and IDCT to be implemented using same PE s, which has the advantages of the testability and fault tolerance. For unified architecture for DCT and IDCT, a -Dim. IDCT algorithm can be obtained as shown in () basically by taking the inverse of the DCT. The shuffling for the inputs (S S), outputs (Q Q) and interstage interconnections (SF) can be implemented by programming the cross points of horizontal and vertical interconnections as shown in Fig.. Only high yield interconnections between butterfly stages excluding low yield semiconductor devices needs to be programmed without redundancy. Furthermore, since interconnections are less susceptible to yield, if entire DCT and IDCT can be made of the same PE s, image codec system can be implemented with shared redundancies only by interconnection programming to alleviate low yield

3 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 Fig.. Contrast geometry DCT/IDCT PE array with redundancies. (a) 45

4 46 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL (b) (c) (d) (e) Fig. 3. Procedures to make I N/p T from A q and A t (a) Duplication and row shuffling for A q to V q, (b) Column shuffling for V t-1 to B t-1 (q = t - 1), (c) Column shuffling for B q to I N/p T q (1 q t - 1), (d) The structure of A t and I N/p T t (q = t), (e) The structure of I N/p T q (1 q t). problem. The proposed array architecture is advantageous for parallel computation, which would enhance the system speed to overcome low-speed SOP devices. The butterflies can be computed recursively with reduced hardware, too. Therefore, it offers scalability of the amount of required hardware and computation speed. Appendix shows the definitions and a -Dim. constant geometry DCT and IDCT algorithm [16] for general radices and input lengths, which is the starting point of the algorithm development. [16] has a drawback of the differences in structure between recursive addition stages and butterfly stages, which requires separate hardware. In this paper, the recursive addition stage is revised to have the same structure as butterfly stage. Shuffling stages are decomposed to utilize symmetry for VLSI implementation. All the derivations can be also applied to IDCT algorithm for unified DCT/IDCT processor with identical hardware. First, for the general cases of radix p and N x N points, 1-Dim. DCT and IDCT matrices are derived to have identical submatrices in diagonal direction by kronecker product and data shuffling operations based on (1) (6) [16], as shown in Fig. 3(a)- (e). As shown in Fig. 3(a), the structure of the submatrices for A is irregular for the stages q (1 q t). Hence, it is hard to implement all decomposed matrices with identical chip structure. The methodologies will be explained in detail later to make the structure of decomposed matrices identical for any q (1 q t) stage by rearranging the elements at the same position of submatrix, which are repeated regularly in each hierarchical stage. Rearranging methods for A q (1 q t - 1) and A q (q = t) are described in Fig. 3(a)-(c) and Fig. 3(d), respectively since they are not identical because of different structures of submatrix D internally. The followings are the procedures to change A q to 1- Dim. DCT decomposed matrix I N/p T q, which has a structure of p x p submatrices arranged in diagonal direction, through V q and B q. First, D is changed to N x N decomposed matrix by duplicating elements inside D using kronecker product in (3). Then, elements are collected (reordering (I)) at the same position in every submatrix that is vertically arranged in D to get V, as shown with the thick lines in the middle of Fig. 3(a). And V is changed to B by column shuffling in (4) that collects elements (reordering(Ⅱ)) which are located at the same position in submatrix, diagonally arranged in V, as shown with the thick lines in Fig. 3(b). And the column shuffling in (5) collects (reordering (III)) the submatrices located at the same position inside every submatrix which is horizontally arranged in B, as shown at left side

5 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, of Fig. 3(c) inside thick line box. Finally, I N/p T q can be obtained with p x p submatrices arranged diagonally, which correspond to decomposed matrices of 1-Dim. DCT. The detailed procedures of reordering I, II, III will be discussed later. For the case of A t, according to (6), A t = I N/p D t as shown in the right side of Fig. 3(d), can be obtained having the same structure of 1-Dim. DCT decomposed matrix derived above without going through the intermediate stage matrices described later, which is different from the case of A q described above using (6). [ ( )] = ( ) [ ( )] where, Q = (Q ), = / ( /, / ) = / = / (, ),, (, ),, (, ) ( 1, 1) = / (1 1) D D, D = D, D = D, D D, / D,, = D,, D,, D,,, D,, (, ) ( / ) = cos (a diagonal matrix of order / ) j p 1, s p 1, n 1, 1 m p/, L = p (1) D,, (, ) πj(pn + 4sN/L + m 1) = cos pn/l ( / ) where, k = / = cos kπ N The detailed descriptions of the rearrangement procedures discussed above are as follows. For A q (1 q t - 1), for the reordering(Ⅰ), the hierarchical structure of submatrix D can be described as follows. [D q ] has 3 hierarchical stages as shown in Fig. 3(a). First, D q j (1 j p - 1) are arranged in column direction inside of [D q ]. And [D q j, m ] (1 m p/) are arranged diagonally in D q j. And [D q j, m, s ] (1 s p - 1) are arranged in row direction inside of each [D q j, m ]. And lastly, as shown in upper right part of Fig. 3(a), D,, (, ) = cos ( / ) / () n 1 are arranged diagonally from top left with n changing from to u - 1 for every [D q j, m, s ]. [X(k)] = Q (V V ) E [x(n)] Where, V = π(p, N) A π(p, N) ( ) = π(p, N) D I / 1 q t (3) By I L/p in (3) shown in the right side of Fig. 3(a), D of size pn pn is expanded into D L L q I L/p of size N N as shown in the middle of Fig. 3(a), by duplicating j, m, D s q (n, located inside L/p times, maintaining the n) matrix structure of D with respect to j, m and s. Then, by reordering(Ⅰ), as shown from the middle to bottom in Fig. 3(a), D q I L/p is changed to V q using π(p, N) according to Eq. (3). First, make a p x 1 submatrix by collecting elements located at the same position in every submatrix D q j arranged in column direction from top to bottom as shown in the thick box of D q I L/p. And by arranging p x 1 submatrices diagonally to make V q m, s and arranging V q m, s horizontally to make V q m (1 m p/) and arranging V q m diagonally, V q can be obtained finally maintaining the same structure of D q j, m, s (n, n) with respect to m and s collecting with respect to only j. [X(k)] = Q (B B ) S [x(n)] Where, B = V B = V π(p/, p /) I / B = (π(p/, N/L) I ) V π(p/, pn/l) I / S = (π(p/, N/) I ) E, L = p, 1 q t (4) In Fig. 3(b), V q m, s shown at the bottom of Fig. 3(a), is represented as (m, s) for simplicity. B q can be derived from V q by reordering(Ⅱ) as follows. As shown in (4), (π(p/,p /) T I N/p ) at the right side and both (π(p/, N/) I L ) and (π(p/, pn/l) T I L/p ) at left and right side, are used for the case of V t-1 (q = t - 1) and V (1 q t - 1), respectively. As shown inside the thick box at the left side of Fig. 3(b), for every submatrix V q m ((m, ) (m, s) (m, p-1)) arranged diagonally, N N p diagonal submatrix((1, ), p (, ) (/p, )) are rearranged in diagonal direction from the left top while maintaining s but changing m to

6 48 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL obtain B with p times of N N/p submatrices arranged horizontally. B = (I / T ) π(p, N) (5) By reordering (Ⅲ), as shown in Fig. 3(c), by π(p, N) -1 at the right side of (5), I N/p T q can be derived with p x p submatrix arranged diagonally. The p x p submatrix is obtained by collecting p x 1 matrix located at the same diagonal position in every N N/p submatrix in row direction sequentially, which is shown by the thick box in B. I / D = I / T where A = I / D (q = t) V = π(p, N) A π(p, N) ( ) = B = (I / T ) π(p, N) π(p, N) A π(p, N) ( ) = I / T π(p, N) π(p, N) A (π(p, N) ) = (I / T ) A = I / D = (I / T ) (6) For the case of A t, since the submatrix D t in A t is already composed of p x p matrices as shown in Fig. 3(d), using A t = I N/p D t = I N/p T t in (6), I N/p T t like I N/p T q can be obtained without going through V q and B matrices, which are to obtain A q described above. According to the procedures above, 1-Dim. DCT decomposed matrix can be obtained with the structure of N x N I N/p T q with p x p submatrix arranged diagonally for all q from the decomposed matrices A q and A t which have different structures. This means that 1-Dim. DCT can be computed by PEs with the same structure, with a butterfly corresponding to the p x p submatrix for radix p. Also, radix p can be adjusted to optimize the complexity of the PE that computes a butterfly, the number of butterflies in a butterfly stage and the degree of parallel processing accordingly, which will be discussed later. In Fig. 4, for the example of radix (p = ), 8 x 8 point (N = 8), the procedures are described to make the structure of submatrix D q of A q identical to I N/p T q for every q(1 q t)th stage by both kronecker product in (1) to (6) and the reordering described above. Since this paper is to implement a DCT/IDCT PE chip for p =, N = 8 8, first, the procedures are described to make a 1-Dim. DCT decomposed matrix identical for the case of p =, N = 8. The coefficients in the matrix shown in Fig. 4 are j, s, n, m in (1) for the case of p =, N = 8 and each case is shown inside the box in terms of k according to (). The cases of A q (1 q t - 1) and A t (q = t) above are described in Fig. 4(a)-(c) respectively. First, for the case of A q (1 q t - 1), as in (1), the internal matrix D q (1 q ) is composed of 8 x 8 D 1 and 4 x 4 D which is shown at the left side of Fig. 4(a). And the procedures are shown in Fig. 4(a) to obtain the 8 x 8 decomposed matrix by repeating internal elements of D 1 and D with I L/p in (3). The procedures after those are shown in Fig. 4(b) only for the case of q =. To illustrate the procedures to get V as an example, the top of Fig. 4(b) shows how to obtain x 1 matrix by collecting the elements at the same vertical location in every 4 x 8 submatrix located vertically by π(p, N) in (3) for the reordering (I) as in the bottom of Fig. 3(a) as well as by rearranging the x 1 matrix according to the locations of elements in the 4 x 8 submatrix. For the reordering (Ⅱ) in Fig. 3(b), because of m, the repetition count of V in diagonal direction equals to 1 for p = as shown in Fig. 4(b), therefore, there is only one V diagonally. Therefore, as shown from top right to left bottom matrices of Fig. 4(b), there is no change between V and B. For reordering (Ⅲ) as in Fig. 3(c), by π(p, N) -1 at the right side of (5) as shown at the bottom of Fig. 4(b), I 4 T can be obtained by collecting nonzero x 1 submatrices at the same location in every 8 x 4 submatrix arranged horizontally then by diagonally rearranging the x submatrix. In case of A t, as shown in Fig. 4(c), by (6), I 4 T 3 with x submatrix arranged diagonally can be obtained without the procedures to get V q and B q for the case of A q (1 q t - 1) explained earlier. Z = I / π(p, N) I B B I / π(p, N) I = I / π(p, N) I I / T π(p, N) I / T π(p, N) I / π(p, N) I = I / T T I / π(p, N) I π(p, N) π(p, N) I / π(p, N) I = K SF (7)

7 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, n =, 1,, 3 1 = k= / in Eq. (8) n =, 1,, 3 [j, s] = [, ] [, 1] 13 [1, ] [1, 1] Permutation for V t-1 in Eq. (9) = 1 Reordering(Ⅱ) = k= [j, s] = [, ] [, 1] Reordering(Ⅲ) π(p,n) -1 in Eq. (5) = k= 13 n =, 1 [1, ] [1, 1] = T 1 18 T T T / in Eq. (8) (b) = k= At q = = 1 k= 1 (a) = 3 = k= 4 4 k= = k= 4 4 Reordering(Ⅰ) π(p,n) in Eq. (8) 4 4 = (c) Fig. 4. Procedures to make I N/p T q from D q (1 q t) (a) Expansion process from D q to D q I L/p (1 q t - 1), (b) Column and row shuffling from D t-1 I L/p to I N/p T t-1 (q = t - 1), (c) Structure of A t and I N/p T t (q = t).

8 41 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL (a) (b) (c) Fig. 5. Procedures to obtain -Dim. DCT/IDCT from 1-Dim. DCT/IDCT (a) A way to make 1-Dim. IDCT decomposed matrices from 1-Dim. DCT, (b) -Dim. DCT decomposed matrices, (c) Procedures to obtain -Dim. IDCT decomposed matrices. Fig. 5 shows how to obtain 1-Dim. IDCT matrix based on 1-Dim. DCT matrix shown in Fig. 3(e). Based on this procedures, -Dim. DCT/IDCT matrix can be obtained using kronecker product and the permutation in (5) and (7) from 1- Dim. DCT decomposed matrix B with nonzero elements shown at the left side of Fig. 3(c). The methods to obtain -Dim. IDCT from -Dim. DCT are the same as obtaining the 1-Dim. IDCT from DCT which

9 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, is shown in Fig. 5(a). -Dim. DCT/IDCT matrix also have identical size decomposed submatrix diagonally for every stage, as 1-Dim. DCT/IDCT matrix shown in Fig. 3(e). All the 1-Dim. DCT matrices (I N/p T q ) shown in Fig. 3(d) have the same size and structure but the values of T q submatrix arranged diagonally are different for all the butterfly stages. Therefore, to show the differences, u submatrices in q th butterfly stage will be defined as T qn (1 n u). 1-Dim. IDCT matrix is composed of the inverse of (I N/p T qn ) as shown in Fig. 5(a), and the diagonal elements of (I N/p T qn ) -1 becomes inverse of submatrix T, which is the diagonal elements of DCT described above, using Appendix Def. 4. Comparing between DCT and IDCT in Fig. 5(a), they all have the same structure of submatrices arranged diagonally except the scaling factors of 1-Dim. DCT corresponding to the submatrix determinant T qn, the sign and the location of matrix elements, which are easily programmable in hardware. Therefore, both DCT and IDCT can be implemented with identical PE chips. In Fig. 5(b), the structure of N x N -Dim. DCT is shown based on the N x N 1-Dim. DCT matrix with the p p submatrix arranged diagonally. As explained in [16], to reduce the global interconnections of -Dim. DCT/IDCT system, like the decomposed matrix Z in (6), the decomposed matrix can be restructured with shuffling matrices at both left and right sides and kronecker product of B q pair, i.e. (B q B q ) at the center, which corresponds to the 1-Dim. DCT decomposed matrix shown in (4). And for the regularity of the PE chip as in the 1 dimensional case, with the permutation in (5) in the middle of (7), B can be restructured into I N/p T q which has submatrices diagonally. And using Lemma 1 and 3, decomposed matrix Z can be restructured to - Dim. DCT matrix I N /p T q T q and the shuffling matrices at the bottom of (7), which will be explained later in detail. Like the input shuffling at the left side of butterfly stage in Fig. 6, the shuffling matrices can be implemented by the interconnection between PE s. As shown in π(p, N) at the bottom of (7), for all the butterfly stages, all the decomposed matrices are composed of variables p and N, which are constant enabling the usage of identical chip structure. Therefore, it has the advantages for the semiconductor chip (a) (b) Fig x 8 constant geometry DCT and IDCT butterfly stage diagram (a) 8 x 8 DCT, (b) 8 x 8 IDCT. implementation only with repeated use of the same structure PE chip. Comparing -Dim. DCT matrix I N /p T q T q and submatrix of 1-Dim. DCT, I N/p T qn shown in Fig. 3(d), the -Dim. DCT matrix has the submatrix T qn T qv, kronecker product of T qn and T qv

10 41 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL pair, instead of T qn for 1-Dim. DCT in the diagonal direction. The p x p submatrix of -Dim. DCT matrix in q-th butterfly stage is defined as T qn T qv n, v u - 1 for two of all u cases of T q s (T qn and T qv ) as shown in Fig. 5(b), based on the definition of T qn which is the submatrix of 1-Dim. DCT matrix. -Dim. IDCT can be obtained from -Dim. DCT as shown in Fig. 5(c), as 1-Dim. IDCT is obtained from 1- Dim. DCT. (I N /p T qn T qv ) -1 for -Dim. IDCT, shown at the top of Fig. 5(c), can be represented as p x p submatrix (T qn T qv ) -1 arranged in diagonal direction as shown in the middle of Fig. 5(c), using Appendix Def. 3 as 1-Dim. IDCT case. And using Appendix Def. 3 again, each submatrix can be represented as T qn -1 T qv -1, kronecker product of p x p submatrix T qn -1 and T qv -1 1-Dim. IDCT shown at the bottom of Fig. 5(c). for The 1-Dim. DCT submatrix T qn shown in Fig. 3(d), can be implemented with LUTs. And also 1-Dim. IDCT with the same structure as 1-Dim. DCT, can be implemented with the same LUT with simple additional hardware for scaling, sign reversal and the shuffling of the elements. In the -Dim. IDCT submatrix, for T qn -1 T qv -1 shown at the bottom of Fig. 5(c), the determinants T qn and T qv can be realized by scaling factor and the rest can be implemented with LUTs without any change of coefficients in LUT, but only with the change of access order and the sign reversal of the coefficient. Therefore, -Dim. IDCT submatrix can be realized by -Dim. DCT submatrix PE. The detailed PE chip design structure will be described later. Proposed architectures based on contrast geometry DCT and IDCT algorithms and detailed VLSI implementation methodologies are described with radix, 8 x 8 DCT as an example. It can be easily applicable to IDCT, since DCT and IDCT have the same computation structure. 8 x 8 DCT can be derived as shown at the right side of (8), where each butterfly stage, Z q in (7) is decomposed into K q and SF. [X(k N + k )] = (Q Q) (I π(,8) I ) (K SF) (K SF) (K SF) (I π(,8) I ) (S S) [X(n N + n )] Where, Q = (Q Q Q ), Q = I / R π(n/l, pn/l) K = (I T T ), T = twiddle factor matrix SF = (I π(,8) I ) π(,8) π(,8) (I π(,8) I ) k, k, n, n =, 1,,,63 i = 1,, log 8, kronecker product (8) By regrouping the shuffling for the input and output as well as inter-butterfly stage, the inter-butterfly stage shuffling can be moved after each butterfly computation, K with OS and IS as in (9). [X(k N + k )] = (Q Q) OS (SF K ) (SF K ) (SF K ) IS [x(n N + n )] where, OS = (I π(,8) I ) SF, IS = SF (I π(,8) I ) (S S) (9) To devise an architecture to exploit constant geometry for both butterfly and recursive addition stage, firstly, the structure of K i SF (i th [X(k N + k )] butterfly stage) is analyzed. = RA RA OS (SF K ) (SF K ) (SF K ) IS [x(n N + n )] Where, RA = π(8,64) (I Q) (1) -Dim. IDCT can be represented as (11) by the inverse matrices of -Dim. DCT in (1). [x(n N + n )] = IS (K SF ) (K SF ) (K SF ) OS (Q Q) [X(k N + k )] (11) Thanks to constant geometry properties, SF (butterfly stage input shuffling) is the same for all butterfly stages as shown in (8). Both SF and SF -1 can be implemented by interconnections only. It can be implemented by programming cross points of horizontal and vertical for interstage interconnections. Only the twiddle factors in T T inside K i are different for butterflies, which can be programmable which is discussed later. The inverse matrices of constant geometry DCT in (9) are constant geometry, too. A butterfly stage (SF K i ) for 8 x 8 DCT in (9) is shown in Fig. 6. To compute the butterfly and recursive addition stages with identical PE s, those stages are made identical to be implemented with a simple butterfly structure. As shown in (9) and Fig. 6, in case a butterfly

11 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, is computed by PEs serially with inputs fed by shift register, SF, the shuffling for 64 input data, has the disadvantage of requiring complex hardware with large latencies, because input data needs to be shuffled with the large distances. Therefore, stage by stage pipeline is used with shared bus to feed the shuffled inputs to PE. The computation structure of the butterfly stage is described in Fig. 7(a). An identical butterfly in a -Dim. DCT and IDCT butterfly stage can be computed by 4 x 4 submatrix with 4 input data. As shown in the above, 4 x 4 submatrix T qn T qv in I 16 T q T q at the top of Fig. 7(a), a row is composed of identical non-zero element values except regular pattern of sign changes for all butterfly stages. And each of the 4 x 4 submatrix is composed of 1 additions and 4 multiplications. The regular pattern of sign changes of 4 x 4 matrix can be formulated with S and S 1 as in Table 1 according to the order of x matrix, T qn and T qv in each stage as shown at the bottom of Fig. 7(a). The methodologies to restructure T qn T qv with I p T qv and I p T qn (p = ) and the shuffling according to (1) are presented to take advantage of repeated hardware usage utilizing the butterfly structure. Therefore, 4 x 4 I T qn and I T qv can be restructured as two regular x T qn, T qv using Lemma 1 and. x T that computes a 1-Dim. DCT butterfly stage, can be represented as a butterfly that computes input data with identical matrix elements except the sign difference for each butterfly stage as shown in the matrix in the middle of Fig. 7(a). In Fig. 7(b), the structures of 4 x 4 and x butterflies are shown based on the butterfly computation matrix as above. As a conclusion, the butterfly to compute 4 input data is composed of the regular repetitions of two butterflies to compute inputs and the shuffling as shown in (1). And each input butterfly for 1-Dim. DCT has computation amount of additions and 1 multiplication. As shown at the bottom of Fig. 7(a), the final x butterfly can be realized by identical PE structure except one regular sign change according to S, S 1 in Table 1. In this paper, based on this regularity explained above, a methodology will be presented to compute the DCT/IDCT with repeated use of identical PE, a minimum computation unit for x butterfly for input data. Table 1. Sign patterns of 4 4 matrix T T = (, ) T T (, ) = (, ) T T (, ) = (, ) T T (, ) = T (, ) T (, ) = (, ) (, ) T (, ) T (, ) (a) (b) Fig. 7. Butterfly and computation structure (a) x, 4 x 4 matrices and sign patterns, (b) Butterfly computation. = ( T ) (, ) ( T ) (, ) (1) Larger (Smaller) radix increases (decreases) parallelism in a butterfly stage with higher (lower) throughput. The PE stage to compute a butterfly stage can be recursively used to compute multiple butterfly stages saving hardware with lower throughput and vice versa. Detailed performance evaluations will be discussed in V. Recursive addition stage, (Q Q) in (9) is composed of irregular structures of additions and complex I/O patterns as shown in Fig. 8. This makes DCT/IDCT system implementation with redundancies complex, requiring multiple types of PE`s for butterfly and recursive addition stages. Therefore, (Q Q) is

12 414 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL Fig. 8. Structure of recursive addition stage. decomposed into two matrices with identical structure as shown in (13) to realize constant geometry for entire DCT stages and to make (Q Q) to be computed by the same PE for the butterflies. = ( )( ) = (8,64) (8,64) ( ) (8,64) (8,64) ( ) = (8,64)( ) (8,64)( ) = (13) ( ) = (14) where, Q = Q.5.5 = Since all of the coefficients of Q are, +1, -1, +, -, 4 and RA is the shuffled version of Q coefficient rows, RA is basically composed of n with integer n. For recursive addition for IDCT, (Q Q) -1 = (RA RA) -1, where RA -1 is composed of only -n with integer n. Therefore, the recursive additions for both DCT and IDCT need only adders and shifters without a multiplier. The final 8 8 DCT algorithm used for PE chip design is formulated as (1). Taking the inverse of (1), 8 x 8 IDCT algorithm can be derived as (11). As shown in (1), the same PE can implement all of DCT/IDCT butterfly as well as recursive addition stage computations. It is particularly useful and efficient for redundancy replacement to alleviate low yield problem of SOP processing technologies. To make the butterfly stage and the recursive addition stage identical in implementation structure, the following modification is made. As shown in (14), all of them have 8 fold symmetries which will be utilized in the following section. Inverses of butterfly and recursive addition computation have 8 fold symmetries, too. K = I (T T ) = I (I T T ) (15) RA = (I Q) (16) The recursive and inverse of recursive addition for - Dim. DCT and IDCT are composed of two identical stages of RA RA in (13) and RA -1 RA -1 in (14), respectively. For simplicity, only one stage 64 x 64 RA and RA -1 are described. To implement both butterfly stage and recursive addition stage with identical PE s, RA and RA -1 are restructured like x butterfly structure discussed above. As shown at the left side of Fig. 9(a), RA in (16) is composed of groups, each has 8 or 4 of 3 types (1 x 1, x 1 and 3 x 1) computations. As shown in the thick box in Fig. 9(a), RA -1 is composed of 8 repetitions of groups composed of 3 types of computation, 1 x 1, x 1 and 4 x 1. Therefore, it is enough to describe the restructuring method for a group or computations which is repeated in a stage or a group, respectively. Fig. 9(b) and (c) show the structure of the computation unit, 1 x 1, x 1 and 3 x 1 for RA and 1 x 1, x 1 and 4 x 1 for RA. As shown in Fig. 9(b) and (c), to compute RA and RA -1 with PE s computing x butterfly which is a computation unit in a butterfly stage, the first two inputs are computed with a PE and the two of the rest inputs are computed with another PE and so on by multiple pipeline stages sequentially. The first type, 1 x 1 computation can be realized just by interconnection for feeding forward input to output without any PE. The second type, x 1 computation uses only one output of x butterfly PE. As shown in Fig. 9(b), the third type, 3 x 1 computation uses two x

13 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, (a) (b) (c) Fig. 9. Recursive addition computation structure (a) Structure of recursive addition and inverse recursive addition matrix, (b) Recursive addition computation, (c) Inverse recursive addition computation, (d) Structure of physical layout of RA and inverse RA. (d) butterfly stages in cascade, which requires PE s. Similarly, the fourth type, 4 x 1 computation uses three stages in cascade, which requires 3 PE s. Also, as shown in Fig. 9(b) and (c), the coefficients required in RA and RA -1 computations are the type of which can be implemented only with simple shifter that will be explained later. Therefore, RA and RA -1 stages shown in Fig. 9(a) can be implemented with identical PE utilizing the constant geometry. In Fig. 9(d), utilizing constant geometry, all the butterfly stages as well as RA and RA -1 can be computed with 3 identical PE s in a stage. One PE for each butterfly and a couple of PE s for RA and RA -1 are allocated with vertically running cross-point interconnections to feed the signals from any points to any points between stages horizontally. Due to the pipeline computation structure explained above, each RA and RA -1 requires 48 PE s compared to 3 PE s for a butterfly stage, which causes difference in height in chip layout. Therefore, as shown in Fig. 9(d), each RA and RA -1 computation is divided into two parts, 1 and. Each part is composed of 3 PE s for the butterfly stage. In case of RA shown in Fig. 9(d), from bottom to top, sixteen groups of two PE s for 3 x 1, eight PE s for x 1 are placed and 8 feed-forward lines for 1 x 1 computation are implemented at the place that corresponds to unused eight PE s. And in case of RA -1, to implement three PE groups for 4 x 1 computation, a vertical output to input feedback interconnections are used perpendicular to RA -1 part shown in Fig. 9(d). x 1 and 1 x 1 computation can be implemented similarly as RA. As shown in Fig. 6, SF and SF -1 have the structure of block matrix repetition every 4 or 8 rows for output pixels. As shown in (8) and (9), both the butterfly

14 416 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL stage and the recursive addition stage have (I 8 ). (I 16 T T ) and Q are computed 8 times per stage for butterfly and recursive addition. All the DCT/IDCT decomposed matrices for the shuffling and butterfly computation have 8 fold symmetries in each stage. Also, the architecture for each stage as shown in Fig. 9 has 8 identical sections with regularity. Therefore, for 8 x 8 DCT and IDCT, only 8 rows are enough to be investigated. The structure of computations with the array of identical PE s and the shuffling interconnections will be discussed later for butterfly and recursive additions. Each PE computes a radix butterfly and a recursive addition for DCT and IDCT. A PE has 4 inputs and outputs for a butterfly and for the recursive addition. To make the throughput the same, the recursive addition needs two PE s in parallel as shown in Fig. 7. (a) (b) III. 8 X 8 CONSTANT GEOMETRY DCT VLSI ARCHITECTURES Based on constant geometry DCT/IDCT algorithm in (1), 8 x 8 DCT/IDCT overall system and PE chip implementations are discussed. 1. Overall Architecture 8 8 DCT/IDCT system is composed of butterfly stages SF K i and recursive addition stage, RA. Both of the input shuffler (IS in (1)) and output shuffler (OS in (1)) are only composed of interconnections with higher yield compared to semiconductor devices. As explained in the previous section, there is 8 fold symmetry in the butterfly and the recursive addition stage. 8 PEs or less are used per stage to exploit the symmetry. Also, PE arrays for a stage can be recursively used to reduce hardware. Based on constant geometry properties, an identical PE can compute two outputs for a radix butterfly for DCT or IDCT as well as recursive addition for RA or RA -1 as shown in Fig. 9 with programmable interconnections. As shown in Fig. 9, each PE computes the butterflies by a multiplier and by shifters to compute n for RA and RA -1 as shown in Fig. 9. Considering speed and power requirements, the number of PE s can be optimized for the required degree of serial and parallel computation. As shown in Fig. 9(a), PE computes pixel outputs from maximum 7 pel inputs with 1,, 16 and (c) Fig x 8 Constant Geometry DCT architecture based on fold symmetry (a) PE, (b) PE I/O, (c) PE array architecture to compute DCT and IDCT butterfly and recursive addition. 3istances away. As shown in Fig. 1(b), PE column array computes a butterfly or recursive addition stage. BCU (Bus Control Unit) programs the interconnections between vertical bus and PE input ports accordingly. The number of PE columns and PE s in the PE column can be optimized according to application areas. Maximum 7 pels are required to compute two outputs of a butterfly stage (SF K i and RA for DCT, K i -1 SF -1 and RA -1 for IDCT). Considering two rows (two outputs) of the matrices above, inputs can be fed into a PE by the programmable interconnections with programmable row start point and pel location offset inside one of 8 fold sections as shown in Fig. 9. Considering low yield of SOP, additional PEs can be used a stage for redundancy

15 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, IV. VLSI IMPLEMENTATIONS VLSI implementation details including the distributed arithmetic multiplier using ROM are discussed based on the architecture in Ⅲ. 1. Multiplication ROM Fig. 11. Programmable PE architecture. replacement. For example, input pels are 1,, 16 and 3 pels away from row start pel location and the row start pel locations are, 4, 8, 9 from the start locations of 8 fold section, which can be easily programmable by the bus control unit for 7 buses in Fig. 1. Furthermore, redundancy PE can be easily implemented using the same bus architectures as shown in Fig PE Architecture Fig. 11 shows a detailed PE architecture to compute both butterfly and recursive addition stage. A PE consists of distributed arithmetic multiplier to multiply twiddle factors to input pixels, shifters for n type coefficient multiplication for RA, RA -1 and adders (subtractor) to sum up the results afterwards. The same adders (subtractor) are used for butterfly and recursive addition stage without any extra hardware. There are differences only in input shuffling bus, which doesn t increase hardware complexity, either. As shown in Fig. 11 for butterfly computation, (x + x 1 ), (x - x 1 ) and c(x - x 1 ) are computed by adder1, adder and distributed arithmetic multiplier, respectively. Shifter shifts input pixels and adder1, adder and adder/subtractor3 add maximum four shifted pixels for recursive addition and also for both K i, SF, RA for DCT and SF -1, K -1 i, RA -1 for IDCT of (1) as shown in Fig. 11, respectively. It can be pipelined to get higher throughput. Each BU receives maximum 7 inputs and computes a butterfly and recursive addition stage to generate outputs. The multiplication for the recursive additions can be done with simple shifts since all the entries are multiples of two as discussed before. The multiplication by n can be pre-programmed for RA and RA -1. A multiplier for inner product of DCT/IDCT needs the most hardware in a PE, which needs to be minimized for low yield SOP. The multiplication ROM is designed to reduce PE chip area utilizing the fact that the number of necessary twiddle factors for DCT/IDCT are limited as follows. A. Accuracy Estimation Since twiddle factors are expressed in limited word length, inaccuracy problem may occur. The required word-length is simulated with the word-length of ROM and datapath as variables with C language. Fig. 1 shows the MSE. As shown in Fig. 1, 15 bit and 13 bit word length for datapath and ROM satisfy CCITT standard [1] sufficiently. For IEEE 118, only 13 bit and 11 bit word lengths for datapath and ROM are required. B. Modified Partial Sum Method Two methodologies to reduce multiplication ROM are presented to save chip area. First, the partial sum method [] is modified to use dual port ROM instead of two separate ROMs. Second, sign processing method using the characteristics of 's complement is used. For formal derivation, the multiplication with s complement input and output is formulated as (17). ROM i (c, x (a:b) ) is the ROM outputs for the partial multiplication results of twiddle factor c to input x bits from a th to b th bit. The bits from a th to b th bit are used for the ROM addresses. (17) shows typical partial sum technique. The input bits are divided into upper and lower bits in order to reduce ROM size, then the output of each ROM is added to produce the final result. However, since ROM for sign bit needs to be included in ROM 1 or ROM in (17), two different ROMs are required. ROM size increases with increasing word- length. Therefore, in order to decrease ROM size and to use the same coefficients for the dual port ROM, (17) is modified as (18).

16 418 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL, (:( 1)/ 1) + (, (:( 1)/ 1) ( 1)/ ( = 1 = ), (:( 1)/ 1) (, (:( 1)/ 1) ( 1)/ ( 1 = 1) (19) (a) (19) shows that the multiplication can be replaced with the addition or subtraction of ROM outputs if ROM bank is selected according to the coefficients. The addresses of ROM are decided according to input bit patterns. Also, since each ROM includes completely identical coefficients, dual port ROM efficiently halves ROM size utilizing linearity. Dual port ROM can be easily implemented compared to RAM since ROM doesn`t need write circuits and hardwired internal data can drive dual port easily. If ROM wordlength in conventional partial sum methods and proposed one are and, respectively, the ratio of ROM cell count can be shown as (). = = ( : ) + ( : ) ( )/ 1 = c x ( : ) + ( : ) ( )/ 1 + ( )/ ( ( )/ : ( )/ ) =, ( : ) +, ( : / ) + (, ( )/ : ) (17) ( )/ = ( : ) ( : ) n-1 / ( )/ = n-1 /-1 ( )/ j+ n-1 /:j+ n-1 / ( )/ ( ( )/ : ( )/ ) (b) Fig. 1. MSE Evaluation (a) Data word length vs. MSE, (b) ROM word length vs. MSE. where, ( ) = ( : ) (18) From (17) and (18), the multiplication result y can be shown as (19). ROM cell count ratio = ( )/ ( )/ ( )/ ( )/ ( )/ () In case of w p w mp, about 3% of ROM cells can be saved. Furthermore, since usually w mp <w p, the efficiency can be greatly increased. C. Multiplication ROM Implementation. Table shows the summary of PE chip design. Fig. 14 shows the photograph of PE chip with low transistor counts. Assuming.18 um processing technologies, the chip area is very small compared to existing DCT/IDCT chips [9, 19]. Even with low yield SOP processing technologies, the PE can be efficiently implemented. Fig. 13 shows the structure of the multiplication ROM. The input conversion unit divides N input bits into (N + 1)/ bit and (N - 1)/ bit, and the upper (N + 1)/ bit is converted into (N 1)/ bit by the sign unit. The resulting two (N - 1)/ bit inputs are sent to the dual port ROM. If the value is positive, the sign unit generates the bits excluding the most significant sign bit, and if the value is negative, it generates s complement of the value. Since 8 x 8 DCT/IDCT needs total 7 coefficients, the dual port ROM uses 3-bit bank control signal to select one of seven coefficients according to the location of PE as well as the input signals shown in Fig. 13,

17 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, 18 Table. Summary of PE chip Transistor count 3K Chip size 1867λ X 534λ (Including pads) (for λ =.5 um case:.93 um X 1.7 um = 1.18 um) (for λ =.18 um case:.34 um X.46 um =.15 um) Throughput 4 pixels/cycle 419 Table 3. Performance evaluation for proposed -Dim. N x N point DCT Throughput p rs N logp N Number of Multipliers Area X time Area X Time rs O(AMULT N logp N ) p O(AMULT N logp N p rs ) Fig. 15. Comparisons of Area X Time. Fig. 13. Structure of muiltiplication ROM. Fig. 14. PE chip photograph. which is only difference inside a PE. Adder/subtractor shown as adder3/subtractor3 in Fig. 11 receives the two outputs of ROM and if the sign bit is positive, (negative), addition (subtraction) is carried out. The adder (subtractor) adds (subtracts) ROM output shifted right by (N - 1)/ bits. As discussed above, other twiddle factors can be synthesized by the multiplication of basic twiddle factors and the number of ROM banks can be minimized further. V. PERFORMANCE ANALYSIS AND ADVANTAGES The performances of the proposed general radix p, N x N point constant geometry DCT/IDCT system are analyzed as shown in Table 3. Assume r PEs are assigned per 1/s of total butterfly stages, logpn. Also, a PE is assumed to be pipelined. Since the recursive addition requiring only shifts and additions, consumes less time compared to butterfly computation, it is excluded in throughput analysis. If one PE computes all the butterflies and recursive addition stages, then r = s = 1. For completely parallel case, r = N/p, s = logpn and N outputs are computed every clock. As r and s increase, the degree of parallel processing is increased. Other architectures to be compared fairly are chosen according to the criteria of no transposition memory [3]. Other architectures except [3] have irregular or global interconnections making VLSI implementation difficult. Fig. 15 shows Area X Time (performance metric for speed sensitive application areas) of the radix p, N x N DCT/IDCT system proposed in this paper compared to other architectures. Both show that the proposed DCT/IDCT system has one of the lowest values among other architectures. Performance values become smaller as radix and the degree of parallel processing increases and the degree of recursive processing decreases. Also, they can be optimized using various radix p, the degree of parallel processing, r and s as well as the degree of recursive processings for wide range of computation performance vs. hardware amount trade off unlike other existing architectures. The performances of PE chip are as follows. Each PE

18 4 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL VI. CONCLUSIONS Fig. 16. Performance evaluations of proposed DCT/IDCT system according to radix, PE count and degree of recursiveness for Area X Time. chip processes two radix butterfly stage or recursive addition input data in single clock cycle and 4pixels/cycle throughput can be obtained. The advantages of proposed DCT/IDCT processor can be summarized as follows. 1. Only simple hardware without transposition memory or multipliers are required.. Although existing DCT systems such as [7] need separate hardware for recursive addition in addition to butterfly, the structure of each butterfly stages, recursive addition stage and interconnection between stages are all the same in the proposed DCT/IDCT system. Only duplicated use of simple PE chip hardware with high yield programmable interconnection is sufficient. This also eases redundancy replacement for fault tolerance even in low yield SOP processing technologies. Furthermore, by programming the order of butterfly and recursive addition stage with interstage interconnections, DCT and IDCT computation can be multiplexed with the same hardware, which reduces hardware amount greatly for mobile codec systems. 3. Based on the regularity, the degree of parallel processing in a stage as well as the degree of recursive processing in entire butterfly stages can be easily optimized with respect to speed, hardware amount according to processing technology yield and application areas as shown in Fig Compared to the partial sum used for existing multiplication ROMs, fewer ROM can perform the same multiplication necessary for DCT/IDCT using the proposed dual port ROM that can be decomposed into two identical parts. VLSI algorithm, architecture and PE chip design methodologies are introduced to implement all the butterfly and recursive addition stages of both DCT and IDCT with only a single PE chip and interconnection programming between stages. It is suitable for image codec system on panel with efficient redundancy in case of low yield SOP processing technologies. Speed vs. hardware amount based on the degree of serial-parallelrecursive processing can be easily optimized. Distributed arithmetic multiplication dual part ROM is proposed to save chip area. Experimental low transistor count 8 X 8 DCT/IDCT PE chip is designed to illustrate the proposed algorithms and architectures. Finally, performances are analyzed with the advantages. ACKNOWLEDGMENT I appreciate Jiseong Yoon, Junkyung Kim and Jeehwan Kim for their efforts on matlab coding and data analysis. APPENDIX Def. 1. A kronecker product is defined for two matrices A, B. =,,,,, Def.. If N = pm, the shuffling π(p, N) is defined as follows. (, ) (), (1),, ( 1) = (), ( ), ( ),, ( 1), (1), ( + 1), (), (1),, ( 1) (, ) = ( (), ( ), ( ),, ( 1), (1), ( + 1), ) A constant geometry DCT algorithm [16] proposed by author is composed of identical butterfly stages for general DCT input lengths and radices as shown in (1). -Dim. constant geometry DCT algorithm with input length N, radix p can be shown as (1) by using the definitions in appendix.

19 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.4, AUGUST, [ ( + )] = ( ) / (, ) ( ) / (, ) ( )[ ( + )] Where, = / / (, ) (, ) (, ) / (, ) = ( ), = / ( /, / ), = (1) R q is a square matrix of order pn/l, to repeat the additions for the following N/p outputs to apply for all the shuffled outputs. [[ ( + )] + ( )] = ( /, ) [ ( )] Where, =, = 1,,, Also, I N is a unit matrix of order N N, T q is a twiddle factor matrix of order q q and S is a shuffling matrix as follows. ( ) = [ ( ),, ( ),, / ( )] [ ( )] = [ (), (1),, ( / 1)] Where, ( ) = ( + 1) / 1 ( ) = ( ) / / 1 1 /, / 1,,,,, = By taking the inverse of (1), IDCT can be derived as shown in (). [ ( + )] = ( ) / (, ) ( ) / (, ) ( ) [ ( + )] () Def. 3. (Lemma 4..1 in [8]) The product of two Kronecker products yields another Kronecker product: ( )( ) =,,,,,,, (3) Def. 4. (Corollary in [8]) If Α, Β are non-singular then (A B) = This property follows directly from the mixed product property Def. 3. Lemma 1 (, )( ) (, ) = / REFERENCES [1] H. G. Walton, M. Brownlow, Invited Paper: The System Integrated LCD, Euro Display 5, pp [] Y. Kida, Y. Toyoshima, et al., Development of System-on-glass Display for Narrow Frame width with Integrated Color Management Function, Euro Display 5, pp [3] S. Y. Lee, J. H. You, Design Considerations on Partition of SOP, CMOS and PCB technologies for Mobile Display System Implementation, IMID 7 Digest. [4] Y. Matsueda, Y. S. Park, et al., Trend of System on Panel, IMID 5 Digest, pp [5] N. Karaki, T. Nanmoto, et al., A Flexible 8-bit Asynchronous Microprocessor Based on Low- Temperature Poly-Silicon (LTPS) TFT Technology, SID 5 Digest, pp [6] T. Nishibe, H. Nakamura, Value-Added Circuit and Function Integration for SOG (System-on Glass) Based on LTPS Technology, SID 6 Digest, pp [7] M. Murase, Y. Kida, et al., Narrow-Frame System-On-Glass Liquid Crystal Display with Low Voltage Interface Circuitry, IDW 6, pp [8] M. Verderber, A. Zemba, A. Trost, HW/SW Codesign of the MPEG- Video Decoder, Proceedings of Int. Parallel and Distributed Processing Symposium, Apr., 3. [9] D. Gong, Y. He, and Z. Cao, New cost-effective VLSI implementation of a -D discrete cosine transform and its inverse, IEEE Trans. Circuits and System on Video Technology, vol. 14, no. 4, pp , Apr., 4. [1] A. Setyono, A. Md. Jahangir, C. Eswaran, "Development and Implementation of Compression and Split Techniques for Multimedia Messaging Service Applications, International Journal of Computer Theory and Engineering, Vol. 6, No. 1, Feb., 14. [11] J. H. Kim, "Mobile terminal capable of controlling various operations using a multi-fingerprint-touch input and method of controlling the operation of the mobile terminal." U.S. Patent , issued June 3, 14. [1] K. Gaedke, J. Franzen, P. Pirsch, A Fault-tolerant

20 4 JAEHEE YOU : DCT/IDCT CONSTANT GEOMETRY ARRAY PROCESSOR FOR CODEC ON DISPLAY PANEL DCT-Architecture based on Distributed Arithmetic, IEEE Int. Symp. on Circuits and Systems, pp , May, [13] A. Hatim, S. Belkouch, T. Sadiki, M. M. Hassani, "Efficient hardware architecture for direct D DCT computation and its FPGA Implementation, 5th International Conference on Microelectronics (ICM), Dec., 13. [14] Q. Shang, Y. Fan, W. Shen, S. Shen, Single-Port SRAM-Based Transpose Memory With Diagonal Data Mapping for Large Size -D DCT/IDCT, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol., No. 11, Nov., 14. [15] S. F. Hsiao, W. R. Shiue, A New Hardware- Efficient Algorithm and Architecture for Computation of -D DCT on a Linear Systolic Array, IEEE Trans. On Circuits and Systems for Video Technology, Vol. 11, No. 11, Nov., 1. [16] J. Kwak, J. You, "1-D and -D Constant geometry fast cosine transform algorithms and architectures," IEEE Trans. on Signal Processing, vol. 47, No. 7, pp. 3-34, July, [17] J. You, Unified Constant Geometry DCT/IDCT for Image Codec System on Display Panel, IEICE Trans. On Fundamentals of Electronics, Communication and Computer Sciences, Vol. E95-A, No. 1, Dec., 1. [18] B. E. Caroline, G. Sheeba, J. Jeyarani, F. S. Roseline mary, "A Reconfigurable DCT/IDCT architecture for video codec : A Review", Third International Conference on Computing, Communication and Networking Technologies (ICCCNT), July, 1. [19] S. Ghosh, S. Venigalla, M. Bayoumi, "Design and Implementation of a D-DCT Architecture using Coefficient Distributed Arithmetic," IEEE Computer Society Annual Symp. on VLSI New Frontier in VLSI Design, 5. [] M. T. Sun, T. C. Chen, A. M. Gottlieb, "VLSI implementation of 16 x 16 Discrete Cosine Transform," IEEE Trans. on Circuits and Systems, vol. 36, No. 4, pp , Apr., [1] CCITT SGXV Working Party XV/1 Specialists Group on Coding for Visual Telephony Document 584, Nov., [] J. I. Guo, C. M. Liu, C. W. Jen "A New Array Architecture for Prime-Length Discrete Cosine Transform, " IEEE Trans. on Signal Processing, Vol. 41, No. 1, Jan., [3] C. Wang and C. Chen, High-throughput VLSI Architectures for the 1-D and -D discrete cosine transform, IEEE Trans. Circuits System on Video Technology, vol. 5, No. 1, pp. 31-4, Feb., [4] N. Cho and S. Lee, Fast algorithm and implementation of -D Discrete cosine transform, IEEE Trans. on Circuits System, vol. 38, No. 3, pp , Mar., [5] W. Ma, -D DCT systolic array implementation, Electron. Lett, vol. 7, pp. 1-, Jan., [6] C. T. Chiu and K. J. R. Liu, Real-time parallel and fully pipelined two-dimensional DCT lattice structures with application to HDTV systems, IEEE Trans. on Circuits and System on Video Technology, vol., pp. 5-37, Mar., 199. [7] N. Cho, D. Yun, S. Lee, "On the regular structure for the fast -D DCT algorithm," IEEE Trans. on Circuits and Systems," Vol. 4, No. 4, Apr., [8] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. CambridgeUniversity Press, Cambridge, Jaehee You received his B.S. degree in Electronics Engineering from Seoul National University, Seoul, Korea, in He received his M.S. and Ph.D. degrees in Electrical Engineering from Cornell University, Ithaca, NY, in 1987 and 199, respectively. In 199, he joined Texas Instruments, Dallas, TX, as a Member of Technical Staff. In 1991, he joined the faculty of the School of Electrical Engineering, Hongik University in Seoul, Korea, where he is now supervising the Semiconductor Integrated System Laboratory. He has served as an Executive Director of Drive technology and System research group of Korean Information Display Society. He was a recipient of Korean Ministry of Strategy and Finance, KEIT Chairman Award for Excellence in 11. He has worked as a technical consultant for various companies such as Samsung Semiconductor, SK Hynix, Global Communication Technologies, P&K, Penta Micro, Nexia device and Primenet. His current research interests include integrated system design for display image signal processing and perceptual image quality enhancement.

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

Speed Optimised CORDIC Based Fast Algorithm for DCT

Speed Optimised CORDIC Based Fast Algorithm for DCT GRD Journals Global Research and Development Journal for Engineering International Conference on Innovations in Engineering and Technology (ICIET) - 2016 July 2016 e-issn: 2455-5703 Speed Optimised CORDIC

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

Design of DWT Module

Design of DWT Module International Journal of Interdisciplinary and Multidisciplinary Studies (IJIMS), 2014, Vol 2, No.1, 47-51. 47 Available online at http://www.ijims.com ISSN: 2348 0343 Design of DWT Module Prabha S VLSI

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8 Page20 A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8 ABSTRACT: Parthiban K G* & Sabin.A.B ** * Professor, M.P. Nachimuthu M. Jaganathan Engineering College, Erode, India ** PG Scholar,

More information

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Performance Analysis of CORDIC Architectures Targeted by FPGA Devices Guddeti Nagarjuna Reddy 1, R.Jayalakshmi 2, Dr.K.Umapathy

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

AN EFFICIENT VLSI IMPLEMENTATION OF IMAGE ENCRYPTION WITH MINIMAL OPERATION

AN EFFICIENT VLSI IMPLEMENTATION OF IMAGE ENCRYPTION WITH MINIMAL OPERATION AN EFFICIENT VLSI IMPLEMENTATION OF IMAGE ENCRYPTION WITH MINIMAL OPERATION 1, S.Lakshmana kiran, 2, P.Sunitha 1, M.Tech Student, 2, Associate Professor,Dept.of ECE 1,2, Pragati Engineering college,surampalem(a.p,ind)

More information

AN FFT PROCESSOR BASED ON 16-POINT MODULE

AN FFT PROCESSOR BASED ON 16-POINT MODULE AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,

More information

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

Architectures of Flynn s taxonomy -- A Comparison of Methods

Architectures of Flynn s taxonomy -- A Comparison of Methods Architectures of Flynn s taxonomy -- A Comparison of Methods Neha K. Shinde Student, Department of Electronic Engineering, J D College of Engineering and Management, RTM Nagpur University, Maharashtra,

More information

LPRAM: A Novel Low-Power High-Performance RAM Design With Testability and Scalability. Subhasis Bhattacharjee and Dhiraj K. Pradhan, Fellow, IEEE

LPRAM: A Novel Low-Power High-Performance RAM Design With Testability and Scalability. Subhasis Bhattacharjee and Dhiraj K. Pradhan, Fellow, IEEE IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 23, NO. 5, MAY 2004 637 LPRAM: A Novel Low-Power High-Performance RAM Design With Testability and Scalability Subhasis

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB

Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB R. Challoo, I.P. Thota, and L. Challoo Texas A&M University-Kingsville Kingsville, Texas 78363-8202, U.S.A. ABSTRACT

More information

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM 1 KALIKI SRI HARSHA REDDY, 2 R.SARAVANAN 1 M.Tech VLSI Design, SASTRA University, Thanjavur, Tamilnadu,

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

Video Compression An Introduction

Video Compression An Introduction Video Compression An Introduction The increasing demand to incorporate video data into telecommunications services, the corporate environment, the entertainment industry, and even at home has made digital

More information

Hardware Optimized DCT/IDCT Implementation on Verilog HDL

Hardware Optimized DCT/IDCT Implementation on Verilog HDL Hardware Optimized DCT/IDCT Implementation on Verilog HDL ECE 734 In this report, I explore 4 implementations for hardware based pipelined DCT/IDCT in Verilog HDL. Conventional DCT/IDCT implementations

More information

FFT/IFFTProcessor IP Core Datasheet

FFT/IFFTProcessor IP Core Datasheet System-on-Chip engineering FFT/IFFTProcessor IP Core Datasheet - Released - Core:120801 Doc: 130107 This page has been intentionally left blank ii Copyright reminder Copyright c 2012 by System-on-Chip

More information

1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM

1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM 1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM 1.1 Introduction Given that digital logic and memory devices are based on two electrical states (on and off), it is natural to use a number

More information

Chapter 4 Implementation of a Test Circuit

Chapter 4 Implementation of a Test Circuit Chapter 4 Implementation of a Test Circuit We use a simplified cost model (which is the number of transistors) to evaluate the performance of our BIST design methods. Although the simplified cost model

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

MCM Based FIR Filter Architecture for High Performance

MCM Based FIR Filter Architecture for High Performance ISSN No: 2454-9614 MCM Based FIR Filter Architecture for High Performance R.Gopalana, A.Parameswari * Department Of Electronics and Communication Engineering, Velalar College of Engineering and Technology,

More information

The MorphoSys Parallel Reconfigurable System

The MorphoSys Parallel Reconfigurable System The MorphoSys Parallel Reconfigurable System Guangming Lu 1, Hartej Singh 1,Ming-hauLee 1, Nader Bagherzadeh 1, Fadi Kurdahi 1, and Eliseu M.C. Filho 2 1 Department of Electrical and Computer Engineering

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online): 2321-0613 A Reconfigurable and Scalable Architecture for Discrete Cosine Transform Maitra S Aldi

More information

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017 VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier 1 Katakam Hemalatha,(M.Tech),Email Id: hema.spark2011@gmail.com 2 Kundurthi Ravi Kumar, M.Tech,Email Id: kundurthi.ravikumar@gmail.com

More information

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers Design and Implementation of Effective Architecture for DCT with Reduced Multipliers Susmitha. Remmanapudi & Panguluri Sindhura Dept. of Electronics and Communications Engineering, SVECW Bhimavaram, Andhra

More information

Efficient Implementation of Low Power 2-D DCT Architecture

Efficient Implementation of Low Power 2-D DCT Architecture Vol. 3, Issue. 5, Sep - Oct. 2013 pp-3164-3169 ISSN: 2249-6645 Efficient Implementation of Low Power 2-D DCT Architecture 1 Kalyan Chakravarthy. K, 2 G.V.K.S.Prasad 1 M.Tech student, ECE, AKRG College

More information

Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders

Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 1, NO. 1, NOVEMBER 2006 1 Tradeoff Analysis and Architecture Design of High Throughput Irregular LDPC Decoders Predrag Radosavljevic, Student

More information

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders 770 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 8, AUGUST 2001 FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders Hyeong-Ju

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. A.Anusha 1 R.Basavaraju 2 anusha201093@gmail.com 1 basava430@gmail.com 2 1 PG Scholar, VLSI, Bharath Institute of Engineering

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS Television services in Europe currently broadcast video at a frame rate of 25 Hz. Each frame consists of two interlaced fields, giving a field rate of 50

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

AT40K FPGA IP Core AT40K-FFT. Features. Description

AT40K FPGA IP Core AT40K-FFT. Features. Description Features Decimation in frequency radix-2 FFT algorithm. 256-point transform. -bit fixed point arithmetic. Fixed scaling to avoid numeric overflow. Requires no external memory, i.e. uses on chip RAM and

More information

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard LETTER IEICE Electronics Express, Vol.10, No.9, 1 11 A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard Hong Liang a), He Weifeng b), Zhu Hui, and Mao Zhigang

More information

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC Thangamonikha.A 1, Dr.V.R.Balaji 2 1 PG Scholar, Department OF ECE, 2 Assitant Professor, Department of ECE 1, 2 Sri Krishna

More information

Improving Memory Repair by Selective Row Partitioning

Improving Memory Repair by Selective Row Partitioning 200 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems Improving Memory Repair by Selective Row Partitioning Muhammad Tauseef Rab, Asad Amin Bawa, and Nur A. Touba Computer

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant

More information

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm Volume-6, Issue-6, November-December 2016 International Journal of Engineering and Management Research Page Number: 229-234 An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications , Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar

More information

Digital Computer Arithmetic

Digital Computer Arithmetic Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Chapter 4. Combinational Logic

Chapter 4. Combinational Logic Chapter 4. Combinational Logic Tong In Oh 1 4.1 Introduction Combinational logic: Logic gates Output determined from only the present combination of inputs Specified by a set of Boolean functions Sequential

More information

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers. Set No. 1 IV B.Tech I Semester Supplementary Examinations, March - 2017 COMPUTER ARCHITECTURE & ORGANIZATION (Common to Electronics & Communication Engineering and Electronics & Time: 3 hours Max. Marks:

More information

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR DLD UNIT III Combinational Circuits (CC), Analysis procedure, Design Procedure, Combinational circuit for different code converters and other problems, Binary Adder- Subtractor, Decimal Adder, Binary Multiplier,

More information

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression Pipelined Fast 2-D DCT Architecture for JPEG Image Compression Luciano Volcan Agostini agostini@inf.ufrgs.br Ivan Saraiva Silva* ivan@dimap.ufrn.br *Federal University of Rio Grande do Norte DIMAp - Natal

More information

SAE5C Computer Organization and Architecture. Unit : I - V

SAE5C Computer Organization and Architecture. Unit : I - V SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy

More information

Design of 2-D DWT VLSI Architecture for Image Processing

Design of 2-D DWT VLSI Architecture for Image Processing Design of 2-D DWT VLSI Architecture for Image Processing Betsy Jose 1 1 ME VLSI Design student Sri Ramakrishna Engineering College, Coimbatore B. Sathish Kumar 2 2 Assistant Professor, ECE Sri Ramakrishna

More information

Implementation of Two Level DWT VLSI Architecture

Implementation of Two Level DWT VLSI Architecture V. Revathi Tanuja et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Implementation of Two Level DWT VLSI Architecture V. Revathi Tanuja*, R V V Krishna ** *(Department

More information

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT Rajalekshmi R Embedded Systems Sree Buddha College of Engineering, Pattoor India Arya Lekshmi M Electronics and Communication

More information

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013 Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

More information

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.

More information

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture SOTIRIOS G. ZIAVRAS and CONSTANTINE N. MANIKOPOULOS Department of Electrical and Computer Engineering New Jersey Institute

More information

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 10 April 2016 ISSN (online): 2349-784X An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT

More information

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm International Journal of Engineering Research and General Science Volume 3, Issue 4, July-August, 15 ISSN 91-2730 A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

More information

Fixed Point Streaming Fft Processor For Ofdm

Fixed Point Streaming Fft Processor For Ofdm Fixed Point Streaming Fft Processor For Ofdm Sudhir Kumar Sa Rashmi Panda Aradhana Raju Abstract Fast Fourier Transform (FFT) processors are today one of the most important blocks in communication systems.

More information

A Novel VLSI Architecture for Digital Image Compression using Discrete Cosine Transform and Quantization

A Novel VLSI Architecture for Digital Image Compression using Discrete Cosine Transform and Quantization International Journal of Electronics and Communication Engineering. ISSN 0974-2166 Volume 4, Number 4 (2011), pp. 425-442 International Research Publication House http://www.irphouse.com A Novel VLSI Architecture

More information

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control, UNIT - 7 Basic Processing Unit: Some Fundamental Concepts, Execution of a Complete Instruction, Multiple Bus Organization, Hard-wired Control, Microprogrammed Control Page 178 UNIT - 7 BASIC PROCESSING

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication

More information

Analysis of Different Multiplication Algorithms & FPGA Implementation

Analysis of Different Multiplication Algorithms & FPGA Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 4, Issue 2, Ver. I (Mar-Apr. 2014), PP 29-35 e-issn: 2319 4200, p-issn No. : 2319 4197 Analysis of Different Multiplication Algorithms & FPGA

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM CHAPTER 4 IMPLEMENTATION OF DIGITAL UPCONVERTER AND DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM 4.1 Introduction FPGAs provide an ideal implementation platform for developing broadband wireless systems such

More information

FAST FOURIER TRANSFORM (FFT) and inverse fast

FAST FOURIER TRANSFORM (FFT) and inverse fast IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 11, NOVEMBER 2004 2005 A Dynamic Scaling FFT Processor for DVB-T Applications Yu-Wei Lin, Hsuan-Yu Liu, and Chen-Yi Lee Abstract This paper presents an

More information

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding Version 1.2 01/99 Order Number: 243651-002 02/04/99 Information in this document is provided in connection with Intel products.

More information

ISSN (ONLINE): , VOLUME-3, ISSUE-1,

ISSN (ONLINE): , VOLUME-3, ISSUE-1, PERFORMANCE ANALYSIS OF LOSSLESS COMPRESSION TECHNIQUES TO INVESTIGATE THE OPTIMUM IMAGE COMPRESSION TECHNIQUE Dr. S. Swapna Rani Associate Professor, ECE Department M.V.S.R Engineering College, Nadergul,

More information

Efficient Radix-4 and Radix-8 Butterfly Elements

Efficient Radix-4 and Radix-8 Butterfly Elements Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721,

More information

Memory System Design. Outline

Memory System Design. Outline Memory System Design Chapter 16 S. Dandamudi Outline Introduction A simple memory block Memory design with D flip flops Problems with the design Techniques to connect to a bus Using multiplexers Using

More information

RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER

RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER Miss. Sushma kumari IES COLLEGE OF ENGINEERING, BHOPAL MADHYA PRADESH Mr. Ashish Raghuwanshi(Assist. Prof.) IES COLLEGE OF ENGINEERING, BHOPAL

More information

A Reduced Routing Network Architecture for Partial Parallel LDPC decoders

A Reduced Routing Network Architecture for Partial Parallel LDPC decoders A Reduced Routing Network Architecture for Partial Parallel LDPC decoders By HOUSHMAND SHIRANI MEHR B.S. (Sharif University of Technology) July, 2009 THESIS Submitted in partial satisfaction of the requirements

More information

Low Power VLSI Implementation of the DCT on Single

Low Power VLSI Implementation of the DCT on Single VLSI DESIGN 2000, Vol. 11, No. 4, pp. 397-403 Reprints available directly from the publisher Photocopying permitted by license only (C) 2000 OPA (Overseas Publishers Association) N.V. Published by license

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR 1 AJAY S. PADEKAR, 2 S. S. BELSARE 1 BVDU, College of Engineering, Pune, India 2 Department of E & TC, BVDU, College of Engineering, Pune, India E-mail: ajay.padekar@gmail.com,

More information

Area and Power efficient MST core supported video codec using CSDA

Area and Power efficient MST core supported video codec using CSDA International Journal of Science, Engineering and Technology Research (IJSETR), Volume 4, Issue 6, June 0 Area and Power efficient MST core supported video codec using A B.Sutha Sivakumari*, B.Mohan**

More information

A General Sign Bit Error Correction Scheme for Approximate Adders

A General Sign Bit Error Correction Scheme for Approximate Adders A General Sign Bit Error Correction Scheme for Approximate Adders Rui Zhou and Weikang Qian University of Michigan-Shanghai Jiao Tong University Joint Institute Shanghai Jiao Tong University, Shanghai,

More information

AMONG various transform techniques for image compression,

AMONG various transform techniques for image compression, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 7, NO. 3, JUNE 1997 459 A Cost-Effective Architecture for 8 8 Two-Dimensional DCT/IDCT Using Direct Method Yung-Pin Lee, Student Member,

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

MRT based Fixed Block size Transform Coding

MRT based Fixed Block size Transform Coding 3 MRT based Fixed Block size Transform Coding Contents 3.1 Transform Coding..64 3.1.1 Transform Selection...65 3.1.2 Sub-image size selection... 66 3.1.3 Bit Allocation.....67 3.2 Transform coding using

More information

END-TERM EXAMINATION

END-TERM EXAMINATION (Please Write your Exam Roll No. immediately) END-TERM EXAMINATION DECEMBER 2006 Exam. Roll No... Exam Series code: 100919DEC06200963 Paper Code: MCA-103 Subject: Digital Electronics Time: 3 Hours Maximum

More information