Design and Implementation of Effective Architecture for DCT with Reduced Multipliers Susmitha. Remmanapudi & Panguluri Sindhura Dept. of Electronics and Communications Engineering, SVECW Bhimavaram, Andhra Pradesh, India E-mail : susmitha.in@gmail.com, sindhupanguluri275@gmail.com Abstract One of the most important operations in the realm of digital signal and image processing is the 2-D Discrete Cosine transform. The main objective of this paper is to explore one of various architectures for optimizing any one or all of the given constraints (hardware, power). Given these constraints (hardware, power) our explored architecture will be a best suited as per the requirement. 2-D DCT is implemented using row column decomposition by the proposed 1-D DCT architecture. The architecture is designed and implemented in VERILOG and synthesized using Xilinx tools and implemented on PGA. The comparison results indicate the considerable power as well as hardware savings in presented architecture as well as systolic architecture. Keywords component; Discrete Cosine Transform, floating point multiplication, floating point addition, systolic. I. INTRODUCTION The discrete cosine transform (DCT), proposed by Ahmed et al. in 1974 [1], has become an increasingly important tool for image, audio filters and video signal processing applications due to its utility and its adoption in standards such as Joint Photographic Experts Group (JPEG), Moving Picture Experts Group (MPEG), and CCITT H.261 [2] [4]. DCT is a computation intensive operation. Its direct implementation requires large number of adders and multipliers. Conventional approach used for 2-D DCT is row-column method. This method requires 2N 1-D DCT s for the computation of N N DCT and a complex matrix transposition architecture which increases the computational complexity as well as area of the chip. On the other hand if the DCT processor is designed using polynomial approach [5, 6] reduces the order of computation as well as the number of adders and multipliers used in the DCT processor will be reduced and area reduction can be considerably achieved. Since DCT has the very good energy packing property, It means, it contains much information with the less number of coefficients and as it is the real part of DT, so computational complexity is also less in case of DCT. Because of these two properties, DCT is preferred over DT. We have also introduced the concept of loating point arithmetic operation which is necessary for the implementation. urther 32-bit loating point adder and multiplier is implemented. A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. In computer architecture, a systolic array is a pipe network arrangement of processing units called cells. It is a specialized form of parallel computing, where cells (i.e. processors), compute data and store it independently of each other. In the proposed paper, the systolic architecture performance is compared with the proposed architecture for DCT. ISim M6.3c simulator is used for simulation, Xilinx ISE synthesis tool has been used for the process of synthesis, and Xilinx Spartan 3E as platform. This paper is organized as follows. Section II details with the back ground of the Discrete Cosine Transform. Section III involves the calculations of the DCT and the reduced multipliers case. The proposed architecture is explained in Section IV. The systolic architecture details are explained in the Section V. The various results and the final comparison are available in the Section VI and conclusions are drawn in the last section. 31
II. 2D DCT Discrete Cosine Transform (DCT) is a computation intensive algorithm has a lot of electronic applications [7]. DCT transforms the information from the time or space domains to the frequency domain to provide compact representation, fast transmission, memory saving and so on. DCT algorithm is very effective due to its symmetry and simplicity [8]. or a 2-D data i, j,0 i 7 and 0 j 7 DCT is given by 2 8 u, v CuC v X i, j X, 8x8 2-D 7 7 *cos Where 0 u 7, 0 v 7 C 2i 1u 2 j 1 v cos i0 j0 16 16 u, Cv 1 for u, v 0, 2 u, Cv otherwise C 1 (1) Implementation computation is reduced by decomposing (1) in two 8x1 1-D DCT given by 1 7 u Cu X i 2 2i 1 u cos i 0 16 (2) III. IMPLEMENTATION O 1D DCT or 2-D DCT computation of a 8x8 2-D data, first row-wise 8x1 1-D DCT is taken for all rows followed by column-wise 8x1 1-D DCT to all columns. Intermediate results of 1-D DCT are stored in transposition memory from [9], equation (2) can be simplified as 0 X0 X1 X2 X 3 X4 X 5 X6 X 7 P.. (2a) 1 X 0 X 7A X 1 X 6B X 2 X 5 C X 3 X.. (2b) 4 D 2 X 0 X 3 X 4 X 7M X 1 X 2 X 5 X 6 N.. (2c) 3 X 0 X 7B X 1 X 6 D X 2 X 5 A X 3 X 4 C.. (2d) 4 X 0 X 1 X 2 X 3 X 4 X 5 X 6 X P.. (2e) 7 5 X 0 X 7C X 1 X 6 A X 2 X 5 D X 3 X B.. (2f) 4 6 X 0 X 3 X 4 X 7N X 1 X 2 X 5 X 6 M.. (2g) 7 X0 X7D X1 X6 CX2 X 5 B X3 X4 A.. (2h) 1 1 3 1 M cos, N cos, P cos 2 8 2 8 2 4 1 1 3 1 5 1 7 A cos, B cos, C cos, D cos 2 16 2 16 2 16 2 16 The equations from (2a) to (2h) can be represented as 0 X 1 X 2 X 3 X 4 X 5 X 6 7, 0 X 1 X 2 X 3 X 4 X 5 X 6 7, a1 X X a2 X X b1 X 0 X 7, b2 X 1 X 6, b3 X 2 X 5, b4 X 3 X 4, c1 X 0 X 3 X 4 X 7and c2 X 1 X 2 X 5 X 6, The equations (2a) to (2h) can be represented using the above coefficients as below: ( 0) a1* P, ( 4) a2* P 1 b1* A b2* B b3* C b4* D, 3 b1* B b2* D b3* A b4* C, 5 b1* C b2* A b3* D b4* B, 7 b1* D b2* C b3* B b4* A, 2 c1* M c2* N and 6 c1* N c2* M, A. IEEE loating Point Representation IEEE single-precision floating point computer numbering format is a binary computing format that occupies 4 bytes (32 bits) in computer memory. In IEEE 754-2008 the 32-bit base 2 format is officially referred to as binary32. It was called single in IEEE 754-1985. Sign bit determines the sign of the number, which is the sign of the significand as well. Exponent is either an 8 bit signed integer from 127 to 128 or an 8 bit unsigned integer from 0 to 255 which is the accepted biased form in IEEE 754 binary32 definition. or this case an exponent value of 127 represents the actual zero. The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. 32
B. Algorithm for multiplier block The multiplication of two 32-bit binary numbers is done by using floating-point multiplication algorithm. The algorithm use two 32-bit binary numbers as inputs and produce one 32-bit binary number as a output. There are 3 steps for multiplying two 32-bit binary numbers Step 1:Calculation of resultant sign bit Step 2: Calculation of resultant exponent bits Step 3: Calculation of resultant floating point bits (fractional part) Step 2: Shift the significant of the number with the smaller exponent, right through a number of bit positions that is equal to the exponent difference. Two of the shifted out bits of the aligned significant are retained as guard (G) and Round (R) bits. So for p bit significant, the effective width of aligned significant must be ( P 2) bits. Append a third bit, namely the sticky bit (S), at the right end of the aligned significant. The sticky bit is the logical OR of all shifted out bits. Step 3: Add the two signed-magnitudes significant using a p + 3 bit adder. Let the result of this is SUM. Step4: Check SUM for carry out C out from the MSB position during addition. Shift SUM right by one-bit position if a carry out is detected and increment the tentative exponent by 1. Evaluate exception conditions, if any. Step 5: Round the result if the logical condition M S M and R represent the R o is true, where 0 th p and p 1 th bits from the left end of the normalized significant. New sticky bit (S) is the logical OR of all bits towards the right of the R bit. If the th rounding condition is true, a 1 is added at the p bit (from the left side) of the normalized significant. If p,msbs of the normalized significant are 1 s, rounding can generate a carry-out. In that case, normalization (step 4) has to be done again. ig. 1 : lowchart for IEEE 754 floating point multiplication C. Algorithm for Adder Block Two 32 bit binary number are added with the help of loating point addition algorithm Step1: Compare the exponents of two numbers and calculate the absolute value of difference between the two exponents. Take the larger exponent as the tentative exponent of the result. ig. 2 : lowchart for IEEE 754 floating point addition 33
IV. PROPOSED 1D DCT ARCHITECTURE The DCT is a computation intensive operation. To calculate 1D DCT, it requires 56 multiplications and 56 addition/subtractions. They are reduced to 22 multiplications in the above architecture by manual calculations. They are further reduced to 10 multiplications in the proposed architecture. The proposed architecture will be as shown in the fig.3 In the above architecture there are seven constant terms whereas in the proposed architecture there are only four constant terms. It includes floating point multiplication and addition block as in the above architecture. The proposed architecture has in all 28 additions/subtractions and 10 multiplications(c 1,C 3 etc.,). The program cell shown in ig.4 requires two real multiplications and two real additions, which are computed at the same time, thus it needs a single clock cycle. ig. 3 : Proposed DCT architecture V. SYSTOLIC ARCHITECTURE OR DCT The N-point discrete transform is decomposed into even- and odd-numbered frequency samples and they are computed independently at the same time. The proposed unified systolic array architecture can compute the DCT by defining different coefficient values specific for each transform. Note that even- and odd-numbered frequency samples are computed independently, thus parallel processing is possible. ig. 4 : Basic cell ig. 5 : Systolic architecture using basic cells In the architecture for computation of the eightpoint DCT shown in fig.5, the input data sequence is fed into the PU from left to right whereas DCT coefficient values are stored in PE's. Also the proposed systolic array can be used for computation of the DST and DHT by changing kernel values in registers of PE's. The unified systolic array as shown in ig.5 requires N2/4 basic cells for the N-point transform, and two real multipliers are needed in each PE. VI. RESULTS or the given architecture in fig 1 and fig 2, consider two fractional numbers 0.3535 and 0.707. In fig.6 and 7, the two inputs represented in IEEE 754 32- bit floating point format is in1 = 00111110101101001111110111110100 in2 = 00111111001101001111110111110100 respectively. So the corresponding floating point addition will have the result of out = 00111111000011110111110011101110. In the same way, the floating point multiplier inputs in1, in2 are considered for multiplication. in1 = 00111110101101001111110111110100 in2=00111111001101001111110111110100 Then the corresponding floating point multiplication will have the result of out signal. out=00111110011111111111011000011011 34
The simulation results for the floating point addition/subtraction and multiplication are shown in fig. 6 and fig.7 ig 6. Simulation results of floating point addition ig 9. Simulation results for the proposed DCT architecture In fig.9, the simulation results for the proposed DCT architecture can be observed. In this, the input is taken in the 8 bit format and the output obtained was in the 32 bit format. ig 7. Simulation results of floating point multiplication ig 8. Simulation results for DCT with 22 multipliers In fig.8, the simulation result corresponding to the DCT with reduced multipliers is specified. or this the 8 bit input is also considered in the 32 bit floating point format and the IEEE 754 floating point addition and multiplication are used to get the desired output. ig 10. Simulation result for systolic architecture In fig. 10 the simulation results for the implementation of DCT using Systolic Architecture are presented. Using Xilinx ISE synthesis tool, the Systolic and proposed architecture Synthesis reports were obtained as shown in ig 11 and 12.The performance of both the systolic and proposed architectures has evaluated using Isim Simulator. Since the systolic architecture is implemented for 1 bit 1D DCT, the synthesis results were obtained as shown in fig. 11. As the proposed architecture implementation deals with 8 bit 1D DCT, the synthesis results are shown in fig.12. If the systolic architecture is implemented for 8 bit DCT the hardware as well as power utilization will exceed than that of the proposed architecture. The 2D DCT can be obtained by multiplying the 1D DCT for row and column decomposition. 35
ig.11 Device Utilization Summary for systolic architecture ig.12 Device Utilization Summary for proposed architecture VII. CONCLUSION or VLSI or hardware parallel implementation of an algorithm, reducing the number of multipliers is very important, because they occupy a large area of the chip. Also important considerations are regularity, modularity in the computation structure, and the complexity of data access scheme. In this context, the architecture proposed in this paper reduces from 56 to 10, the number of multipliers being used. The comparison of different architectures is also done. VIII REERENCES [1] M. P. Leong and Philip H. W. Leong, A Variable-Radix Digit-Serial Design Methodology and its Application to the Discrete Cosine Transform, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, VOL. 11, NO. 1, ebruary 2003. [2] Clay Gloster, Jr., Wanda Gay, Michaela Amoo, and Mohamed Chouikha, Optimizing the Design of a Configurable Digital Signal Processor for Accelerated Execution of the 2-D Discrete Cosine Transform, Proceedings of the 39th Hawaii International Conference on System Sciences 2006. [3] Cai Ken, Liang Xiaoying, Liu Chuanju, SOPC based flexible architecture for JPEG enconder, Proceedings of 2009 4th International Conference on Computer Science & Education,2009. [4] Thuyen Le and Manfred Glesner, lexible Architectures for DCT of Variable-Length Targeting Shape-Adaptive Transform, IEEE Transactions on Circuits and Systems or Video Technology, VOL. 10, NO. 8, December 2000. [5] Nam Ik Cho, Sang Uk Lee ast Algorithm and Implementation of 2-D Discrete Cosine Transform, IEEE Transaction on Circuits and Systems, Vol.38,No.3, March 1991. [6] M. Vetterli, ast 2-D Discrete Cosine Transform, in Proc. ICASSP 85. Mar.1985. [7] Peng Chungan, Cao Xixin, Yu Dunshan, Zhang Xing, A 250MHz optimized distributed architecture of 2D 8x8 DCT, 7th International Conference on ASIC, pp. 189 192, Oct. 2007. [8] Roger Endrigo Carvalho Porto, Luciano Volcan Agostini Project Space Exploration on the 2-D DCT Architecture of a JPEG Compressor Directed to PGA implementation IEEE, 2004 [9] VijayaPrakash and K.S.Gurumurthy. A Novel VLSI Architecture for Digital Image Compression Using Discrete Cosine Transform and Quantization IJCSNS September2010 36