Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Similar documents
Pipelined Fast 2-D DCT Architecture for JPEG Image Compression

Implementation of Floating Point Multiplier Using Dadda Algorithm

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

Measuring Improvement When Using HUB Formats to Implement Floating-Point Systems under Round-to- Nearest

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

University, Patiala, Punjab, India 1 2

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

Implementation of Double Precision Floating Point Multiplier on FPGA

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

DUE to the high computational complexity and real-time

ISSN Vol.02, Issue.11, December-2014, Pages:

Efficient Implementation of Low Power 2-D DCT Architecture

FPGA IMPLEMENTATION OF HIGH SPEED DCT COMPUTATION OF JPEG USING VEDIC MULTIPLIER

A High Speed Binary Floating Point Multiplier Using Dadda Algorithm

Design and Implementation of 3-D DWT for Video Processing Applications

VLSI Computational Architectures for the Arithmetic Cosine Transform

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

An FPGA based Implementation of Floating-point Multiplier

An Efficient Implementation of Floating Point Multiplier

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

International Journal for Research in Applied Science & Engineering Technology (IJRASET) IIR filter design using CSA for DSP applications

Implementation of Double Precision Floating Point Multiplier in VHDL

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

Design of 2-D DWT VLSI Architecture for Image Processing

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

International Journal of Research in Computer and Communication Technology, Vol 4, Issue 11, November- 2015

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

3 Data Storage 3.1. Foundations of Computer Science Cengage Learning

Study, Implementation and Survey of Different VLSI Architectures for Multipliers

Keywords - DWT, Lifting Scheme, DWT Processor.

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Floating Point Arithmetic

f. ws V r.» ««w V... V, 'V. v...

Design and Optimized Implementation of Six-Operand Single- Precision Floating-Point Addition

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

VHDL IMPLEMENTATION OF FLOATING POINT MULTIPLIER USING VEDIC MATHEMATICS

Systolic Arrays for Reconfigurable DSP Systems

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

Using Shift Number Coding with Wavelet Transform for Image Compression

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

FLOATING POINT ADDERS AND MULTIPLIERS

CO212 Lecture 10: Arithmetic & Logical Unit

Performance analysis of Integer DCT of different block sizes.

Implementation of a High Speed Binary Floating point Multiplier Using Dadda Algorithm in FPGA

Implementation of Two Level DWT VLSI Architecture

Efficient design and FPGA implementation of JPEG encoder

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8

A Novel Discrete cosine transforms & Distributed arithmetic

Design and Implementation of Low-Complexity Redundant Multiplier Architecture for Finite Field

Memory-Efficient and High-Speed Line-Based Architecture for 2-D Discrete Wavelet Transform with Lifting Scheme

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

FPGA Implementation of Low-Area Floating Point Multiplier Using Vedic Mathematics

Design of Double Precision Floating Point Multiplier Using Vedic Multiplication

Double Precision Floating-Point Arithmetic on FPGAs

A Binary Floating-Point Adder with the Signed-Digit Number Arithmetic

A High Speed Design of 32 Bit Multiplier Using Modified CSLA

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard

IMPLEMENTATION OF LOW-COMPLEXITY REDUNDANT MULTIPLIER ARCHITECTURE FOR FINITE FIELD

Floating-Point Matrix Product on FPGA

Design and Implementation of IEEE-754 Decimal Floating Point Adder, Subtractor and Multiplier

HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

FPGA Implementation of Single Precision Floating Point Multiplier Using High Speed Compressors

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Efficient Floating-Point Representation for Balanced Codes for FPGA Devices (preprint version)

DESIGN OF DCT ARCHITECTURE USING ARAI ALGORITHMS

A Novel VLSI Architecture for Digital Image Compression using Discrete Cosine Transform and Quantization

Pipelined High Speed Double Precision Floating Point Multiplier Using Dadda Algorithm Based on FPGA

DESIGN OF QUATERNARY ADDER FOR HIGH SPEED APPLICATIONS

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Implementation of Double Precision Floating Point Multiplier Using Wallace Tree Multiplier

AN ANALYTICAL STUDY OF LOSSY COMPRESSION TECHINIQUES ON CONTINUOUS TONE GRAPHICAL IMAGES

Multimedia Communications. Transform Coding

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

PERFORMANCE ANALYSIS OF INTEGER DCT OF DIFFERENT BLOCK SIZES USED IN H.264, AVS CHINA AND WMV9.

CHAPTER 1 Numerical Representation

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank

Divide: Paper & Pencil

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

HIGH LEVEL SYNTHESIS OF A 2D-DWT SYSTEM ARCHITECTURE FOR JPEG 2000 USING FPGAs

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Transcription:

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers Susmitha. Remmanapudi & Panguluri Sindhura Dept. of Electronics and Communications Engineering, SVECW Bhimavaram, Andhra Pradesh, India E-mail : susmitha.in@gmail.com, sindhupanguluri275@gmail.com Abstract One of the most important operations in the realm of digital signal and image processing is the 2-D Discrete Cosine transform. The main objective of this paper is to explore one of various architectures for optimizing any one or all of the given constraints (hardware, power). Given these constraints (hardware, power) our explored architecture will be a best suited as per the requirement. 2-D DCT is implemented using row column decomposition by the proposed 1-D DCT architecture. The architecture is designed and implemented in VERILOG and synthesized using Xilinx tools and implemented on PGA. The comparison results indicate the considerable power as well as hardware savings in presented architecture as well as systolic architecture. Keywords component; Discrete Cosine Transform, floating point multiplication, floating point addition, systolic. I. INTRODUCTION The discrete cosine transform (DCT), proposed by Ahmed et al. in 1974 [1], has become an increasingly important tool for image, audio filters and video signal processing applications due to its utility and its adoption in standards such as Joint Photographic Experts Group (JPEG), Moving Picture Experts Group (MPEG), and CCITT H.261 [2] [4]. DCT is a computation intensive operation. Its direct implementation requires large number of adders and multipliers. Conventional approach used for 2-D DCT is row-column method. This method requires 2N 1-D DCT s for the computation of N N DCT and a complex matrix transposition architecture which increases the computational complexity as well as area of the chip. On the other hand if the DCT processor is designed using polynomial approach [5, 6] reduces the order of computation as well as the number of adders and multipliers used in the DCT processor will be reduced and area reduction can be considerably achieved. Since DCT has the very good energy packing property, It means, it contains much information with the less number of coefficients and as it is the real part of DT, so computational complexity is also less in case of DCT. Because of these two properties, DCT is preferred over DT. We have also introduced the concept of loating point arithmetic operation which is necessary for the implementation. urther 32-bit loating point adder and multiplier is implemented. A systolic array is an arrangement of processors in an array where data flows synchronously across the array between neighbors, usually with different data flowing in different directions. In computer architecture, a systolic array is a pipe network arrangement of processing units called cells. It is a specialized form of parallel computing, where cells (i.e. processors), compute data and store it independently of each other. In the proposed paper, the systolic architecture performance is compared with the proposed architecture for DCT. ISim M6.3c simulator is used for simulation, Xilinx ISE synthesis tool has been used for the process of synthesis, and Xilinx Spartan 3E as platform. This paper is organized as follows. Section II details with the back ground of the Discrete Cosine Transform. Section III involves the calculations of the DCT and the reduced multipliers case. The proposed architecture is explained in Section IV. The systolic architecture details are explained in the Section V. The various results and the final comparison are available in the Section VI and conclusions are drawn in the last section. 31

II. 2D DCT Discrete Cosine Transform (DCT) is a computation intensive algorithm has a lot of electronic applications [7]. DCT transforms the information from the time or space domains to the frequency domain to provide compact representation, fast transmission, memory saving and so on. DCT algorithm is very effective due to its symmetry and simplicity [8]. or a 2-D data i, j,0 i 7 and 0 j 7 DCT is given by 2 8 u, v CuC v X i, j X, 8x8 2-D 7 7 *cos Where 0 u 7, 0 v 7 C 2i 1u 2 j 1 v cos i0 j0 16 16 u, Cv 1 for u, v 0, 2 u, Cv otherwise C 1 (1) Implementation computation is reduced by decomposing (1) in two 8x1 1-D DCT given by 1 7 u Cu X i 2 2i 1 u cos i 0 16 (2) III. IMPLEMENTATION O 1D DCT or 2-D DCT computation of a 8x8 2-D data, first row-wise 8x1 1-D DCT is taken for all rows followed by column-wise 8x1 1-D DCT to all columns. Intermediate results of 1-D DCT are stored in transposition memory from [9], equation (2) can be simplified as 0 X0 X1 X2 X 3 X4 X 5 X6 X 7 P.. (2a) 1 X 0 X 7A X 1 X 6B X 2 X 5 C X 3 X.. (2b) 4 D 2 X 0 X 3 X 4 X 7M X 1 X 2 X 5 X 6 N.. (2c) 3 X 0 X 7B X 1 X 6 D X 2 X 5 A X 3 X 4 C.. (2d) 4 X 0 X 1 X 2 X 3 X 4 X 5 X 6 X P.. (2e) 7 5 X 0 X 7C X 1 X 6 A X 2 X 5 D X 3 X B.. (2f) 4 6 X 0 X 3 X 4 X 7N X 1 X 2 X 5 X 6 M.. (2g) 7 X0 X7D X1 X6 CX2 X 5 B X3 X4 A.. (2h) 1 1 3 1 M cos, N cos, P cos 2 8 2 8 2 4 1 1 3 1 5 1 7 A cos, B cos, C cos, D cos 2 16 2 16 2 16 2 16 The equations from (2a) to (2h) can be represented as 0 X 1 X 2 X 3 X 4 X 5 X 6 7, 0 X 1 X 2 X 3 X 4 X 5 X 6 7, a1 X X a2 X X b1 X 0 X 7, b2 X 1 X 6, b3 X 2 X 5, b4 X 3 X 4, c1 X 0 X 3 X 4 X 7and c2 X 1 X 2 X 5 X 6, The equations (2a) to (2h) can be represented using the above coefficients as below: ( 0) a1* P, ( 4) a2* P 1 b1* A b2* B b3* C b4* D, 3 b1* B b2* D b3* A b4* C, 5 b1* C b2* A b3* D b4* B, 7 b1* D b2* C b3* B b4* A, 2 c1* M c2* N and 6 c1* N c2* M, A. IEEE loating Point Representation IEEE single-precision floating point computer numbering format is a binary computing format that occupies 4 bytes (32 bits) in computer memory. In IEEE 754-2008 the 32-bit base 2 format is officially referred to as binary32. It was called single in IEEE 754-1985. Sign bit determines the sign of the number, which is the sign of the significand as well. Exponent is either an 8 bit signed integer from 127 to 128 or an 8 bit unsigned integer from 0 to 255 which is the accepted biased form in IEEE 754 binary32 definition. or this case an exponent value of 127 represents the actual zero. The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. 32

B. Algorithm for multiplier block The multiplication of two 32-bit binary numbers is done by using floating-point multiplication algorithm. The algorithm use two 32-bit binary numbers as inputs and produce one 32-bit binary number as a output. There are 3 steps for multiplying two 32-bit binary numbers Step 1:Calculation of resultant sign bit Step 2: Calculation of resultant exponent bits Step 3: Calculation of resultant floating point bits (fractional part) Step 2: Shift the significant of the number with the smaller exponent, right through a number of bit positions that is equal to the exponent difference. Two of the shifted out bits of the aligned significant are retained as guard (G) and Round (R) bits. So for p bit significant, the effective width of aligned significant must be ( P 2) bits. Append a third bit, namely the sticky bit (S), at the right end of the aligned significant. The sticky bit is the logical OR of all shifted out bits. Step 3: Add the two signed-magnitudes significant using a p + 3 bit adder. Let the result of this is SUM. Step4: Check SUM for carry out C out from the MSB position during addition. Shift SUM right by one-bit position if a carry out is detected and increment the tentative exponent by 1. Evaluate exception conditions, if any. Step 5: Round the result if the logical condition M S M and R represent the R o is true, where 0 th p and p 1 th bits from the left end of the normalized significant. New sticky bit (S) is the logical OR of all bits towards the right of the R bit. If the th rounding condition is true, a 1 is added at the p bit (from the left side) of the normalized significant. If p,msbs of the normalized significant are 1 s, rounding can generate a carry-out. In that case, normalization (step 4) has to be done again. ig. 1 : lowchart for IEEE 754 floating point multiplication C. Algorithm for Adder Block Two 32 bit binary number are added with the help of loating point addition algorithm Step1: Compare the exponents of two numbers and calculate the absolute value of difference between the two exponents. Take the larger exponent as the tentative exponent of the result. ig. 2 : lowchart for IEEE 754 floating point addition 33

IV. PROPOSED 1D DCT ARCHITECTURE The DCT is a computation intensive operation. To calculate 1D DCT, it requires 56 multiplications and 56 addition/subtractions. They are reduced to 22 multiplications in the above architecture by manual calculations. They are further reduced to 10 multiplications in the proposed architecture. The proposed architecture will be as shown in the fig.3 In the above architecture there are seven constant terms whereas in the proposed architecture there are only four constant terms. It includes floating point multiplication and addition block as in the above architecture. The proposed architecture has in all 28 additions/subtractions and 10 multiplications(c 1,C 3 etc.,). The program cell shown in ig.4 requires two real multiplications and two real additions, which are computed at the same time, thus it needs a single clock cycle. ig. 3 : Proposed DCT architecture V. SYSTOLIC ARCHITECTURE OR DCT The N-point discrete transform is decomposed into even- and odd-numbered frequency samples and they are computed independently at the same time. The proposed unified systolic array architecture can compute the DCT by defining different coefficient values specific for each transform. Note that even- and odd-numbered frequency samples are computed independently, thus parallel processing is possible. ig. 4 : Basic cell ig. 5 : Systolic architecture using basic cells In the architecture for computation of the eightpoint DCT shown in fig.5, the input data sequence is fed into the PU from left to right whereas DCT coefficient values are stored in PE's. Also the proposed systolic array can be used for computation of the DST and DHT by changing kernel values in registers of PE's. The unified systolic array as shown in ig.5 requires N2/4 basic cells for the N-point transform, and two real multipliers are needed in each PE. VI. RESULTS or the given architecture in fig 1 and fig 2, consider two fractional numbers 0.3535 and 0.707. In fig.6 and 7, the two inputs represented in IEEE 754 32- bit floating point format is in1 = 00111110101101001111110111110100 in2 = 00111111001101001111110111110100 respectively. So the corresponding floating point addition will have the result of out = 00111111000011110111110011101110. In the same way, the floating point multiplier inputs in1, in2 are considered for multiplication. in1 = 00111110101101001111110111110100 in2=00111111001101001111110111110100 Then the corresponding floating point multiplication will have the result of out signal. out=00111110011111111111011000011011 34

The simulation results for the floating point addition/subtraction and multiplication are shown in fig. 6 and fig.7 ig 6. Simulation results of floating point addition ig 9. Simulation results for the proposed DCT architecture In fig.9, the simulation results for the proposed DCT architecture can be observed. In this, the input is taken in the 8 bit format and the output obtained was in the 32 bit format. ig 7. Simulation results of floating point multiplication ig 8. Simulation results for DCT with 22 multipliers In fig.8, the simulation result corresponding to the DCT with reduced multipliers is specified. or this the 8 bit input is also considered in the 32 bit floating point format and the IEEE 754 floating point addition and multiplication are used to get the desired output. ig 10. Simulation result for systolic architecture In fig. 10 the simulation results for the implementation of DCT using Systolic Architecture are presented. Using Xilinx ISE synthesis tool, the Systolic and proposed architecture Synthesis reports were obtained as shown in ig 11 and 12.The performance of both the systolic and proposed architectures has evaluated using Isim Simulator. Since the systolic architecture is implemented for 1 bit 1D DCT, the synthesis results were obtained as shown in fig. 11. As the proposed architecture implementation deals with 8 bit 1D DCT, the synthesis results are shown in fig.12. If the systolic architecture is implemented for 8 bit DCT the hardware as well as power utilization will exceed than that of the proposed architecture. The 2D DCT can be obtained by multiplying the 1D DCT for row and column decomposition. 35

ig.11 Device Utilization Summary for systolic architecture ig.12 Device Utilization Summary for proposed architecture VII. CONCLUSION or VLSI or hardware parallel implementation of an algorithm, reducing the number of multipliers is very important, because they occupy a large area of the chip. Also important considerations are regularity, modularity in the computation structure, and the complexity of data access scheme. In this context, the architecture proposed in this paper reduces from 56 to 10, the number of multipliers being used. The comparison of different architectures is also done. VIII REERENCES [1] M. P. Leong and Philip H. W. Leong, A Variable-Radix Digit-Serial Design Methodology and its Application to the Discrete Cosine Transform, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, VOL. 11, NO. 1, ebruary 2003. [2] Clay Gloster, Jr., Wanda Gay, Michaela Amoo, and Mohamed Chouikha, Optimizing the Design of a Configurable Digital Signal Processor for Accelerated Execution of the 2-D Discrete Cosine Transform, Proceedings of the 39th Hawaii International Conference on System Sciences 2006. [3] Cai Ken, Liang Xiaoying, Liu Chuanju, SOPC based flexible architecture for JPEG enconder, Proceedings of 2009 4th International Conference on Computer Science & Education,2009. [4] Thuyen Le and Manfred Glesner, lexible Architectures for DCT of Variable-Length Targeting Shape-Adaptive Transform, IEEE Transactions on Circuits and Systems or Video Technology, VOL. 10, NO. 8, December 2000. [5] Nam Ik Cho, Sang Uk Lee ast Algorithm and Implementation of 2-D Discrete Cosine Transform, IEEE Transaction on Circuits and Systems, Vol.38,No.3, March 1991. [6] M. Vetterli, ast 2-D Discrete Cosine Transform, in Proc. ICASSP 85. Mar.1985. [7] Peng Chungan, Cao Xixin, Yu Dunshan, Zhang Xing, A 250MHz optimized distributed architecture of 2D 8x8 DCT, 7th International Conference on ASIC, pp. 189 192, Oct. 2007. [8] Roger Endrigo Carvalho Porto, Luciano Volcan Agostini Project Space Exploration on the 2-D DCT Architecture of a JPEG Compressor Directed to PGA implementation IEEE, 2004 [9] VijayaPrakash and K.S.Gurumurthy. A Novel VLSI Architecture for Digital Image Compression Using Discrete Cosine Transform and Quantization IJCSNS September2010 36