Implementation of a Unified DSP Coprocessor

Similar documents
M.N.MURTY Department of Physics, National Institute of Science and Technology, Palur Hills, Berhampur , Odisha (INDIA).

VHDL Implementation of DIT-FFT using CORDIC

DUE to the high computational complexity and real-time

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

FPGA IMPLEMENTATION OF CORDIC ALGORITHM ARCHITECTURE

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

A VLSI Array Architecture for Realization of DFT, DHT, DCT and DST

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

Speed Optimised CORDIC Based Fast Algorithm for DCT

FPGA Design, Implementation and Analysis of Trigonometric Generators using Radix 4 CORDIC Algorithm

Three-D DWT of Efficient Architecture

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

The Serial Commutator FFT

CORDIC Based DFT on FPGA for DSP Applications

FPGA Implementation of CORDIC Based DHT for Image Processing Applications

VLSI Computational Architectures for the Arithmetic Cosine Transform

An Efficient FPGA Implementation of the Advanced Encryption Standard (AES) Algorithm Using S-Box

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Design of FPGA Based Radix 4 FFT Processor using CORDIC

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

Efficient Implementation of Low Power 2-D DCT Architecture

Power and Area Efficient Implementation for Parallel FIR Filters Using FFAs and DA

Design of 2-D DWT VLSI Architecture for Image Processing

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

International Journal of Innovative and Emerging Research in Engineering. e-issn: p-issn:

ABSTRACT I. INTRODUCTION. 905 P a g e

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

VLSI IMPLEMENTATION OF CORDIC BASED ROBOT NAVIGATION PROCESSOR

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

Implementation of Two Level DWT VLSI Architecture

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

A Modified CORDIC Processor for Specific Angle Rotation based Applications

4. Image Retrieval using Transformed Image Content

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

A Dedicated Hardware Solution for the HEVC Interpolation Unit

VLSI Architecture to Detect/Correct Errors in Motion Estimation Using Biresidue Codes

AMONG various transform techniques for image compression,

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

FPGA IMPLEMENTATION OF HIGH SPEED DCT COMPUTATION OF JPEG USING VEDIC MULTIPLIER

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

A full-pipelined 2-D IDCT/ IDST VLSI architecture with adaptive block-size for HEVC standard

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Implementation Of Quadratic Rotation Decomposition Based Recursive Least Squares Algorithm

MANY image and video compression standards such as

ISSN Vol.05,Issue.09, September-2017, Pages:

FPGA Based Low Area Motion Estimation with BISCD Architecture

the main limitations of the work is that wiring increases with 1. INTRODUCTION

Design And Simulation Of Pipelined Radix-2 k Feed-Forward FFT Architectures

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

@ 2014 SEMAR GROUPS TECHNICAL SOCIETY.

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

High Speed Special Function Unit for Graphics Processing Unit

Low Power Complex Multiplier based FFT Processor

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

Modified CORDIC Architecture for Fixed Angle Rotation

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

High Speed Radix 8 CORDIC Processor

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Twiddle Factor Transformation for Pipelined FFT Processing

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

Novel design of multiplier-less FFT processors

Introduction to Field Programmable Gate Arrays

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

High Performance Architecture for Reciprocal Function Evaluation on Virtex II FPGA

IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL

Area-Delay-Power Efficient Carry-Select Adder

Fast FPGA Routing Approach Using Stochestic Architecture

FPGA Implementation of Efficient Carry-Select Adder Using Verilog HDL

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture

FPGA Implementation of 4-Point and 8-Point Fast Hadamard Transform

IMPLEMENTATION OF SCALING FREE CORDIC AND ITS APPLICATION

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

Design and Implementation of Parallel AES Encryption Engines for Multi-Core Processor Arrays

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

AN FFT PROCESSOR BASED ON 16-POINT MODULE

Digital Signal Processing Lecture Notes 22 November 2010

FPGA Polyphase Filter Bank Study & Implementation

Area Delay Power Efficient Carry-Select Adder

Optimized Design Platform for High Speed Digital Filter using Folding Technique

High Speed Han Carlson Adder Using Modified SQRT CSLA

Algorithm of efficient computation DST I-IV using cyclic convolutions

System Verification of Hardware Optimization Based on Edge Detection

Transcription:

Vol. (), Jan,, pp 3-43, ISS: 35-543 Implementation of a Unified DSP Coprocessor Mojdeh Mahdavi Department of Electronics, Shahr-e-Qods Branch, Islamic Azad University, Tehran, Iran *Corresponding author's E-mail: Mahdavi@Shahryariau.ac.ir Abstract Utilizing the DFT, the DHT, the DCT or the DST is an obvious choice in signal processing domain. This paper describes the implementation of a unified coprocessor of transform length '8' for the synchronous design in XC3S4A-4FG484 FPGA device of Xilinx Company. The operating frequency of MHz is achieved. The paper presents the trade-offs involved in designing the architecture, the design for performance issues and the possibilities for future development. Keywords: Coprocessor, Discrete Transforms, Implementation. Introduction Memory based Field Programmable Gate Arrays (FPGAs) have the advantage of realtime in-circuit re-configurability as opposed to other gate arrays of similar gate density. This advantage translates into unlimited, in-circuit flexibility, re-configurability and reliability, facilitating prototyping of complex electronic designs []. The high capacity and performance that FPGAs have achieved in recent years allow them to accelerate digital signal processing (DSP) tasks. FPGA devices have been used to implement Custom DSPs since the beginning of this decade []. Usually, FPGAs are used as VLSI replacement on low volume production or prototyping devices which are to be eventually implemented as ASICs. Their % testability and the possibility of achieving a high degree of fault coverage makes them increasingly attractive for complex designs with multiple (and of course limited) iterations on their design cycles []. The FPGA devices have benefited from the 3

Vol. (), Jan,, pp 3-43, ISS: 35-543 improvements in VLSI technology, leading to higher speed and capability as well as lower power consumption []. The discrete transform algorithms are very well known and due to their versatility and very simple hardware implementation are widely used in VLSI digital signal processing systems. The discrete Hartley transform (DHT) is similar to the DFT, with the only difference that it deals only with real computation. The discrete cosine transform (DCT) has long been used in image and speech processing. The JPEG standard till JPEG used the DCT as the basis function. The discrete sine transform (DST) is useful for spectrum analysis, data compression, speech processing, biomedical signal processing and in many other applications. These basic signal processing transforms are required in almost all the phases of image and signal processing and cover a large range of biomedical signal and image processing, for various imaging techniques and spectral analysis of the signals [4]. A number of architectures are proposed for the realization of these transforms [-7]. However, a unified architecture, which can compute all these transforms, can serve the purpose of a general DSP chip, and therefore a unified architecture has been adopted to obtain all the transforms in a single FPGA chip. The basic structure of all the transforms, DFT, DCT, DHT and DST, are almost equivalent and this property has been exploited in the design of the unified architecture.. Discrete Transforms This Section presents the transforms in detail and the possibility of their implementation as the basic processing elements. For a real sample sequence n), where n is (,,..., -) the discrete transforms which are the DFT, the DHT, the DCT and the DST, can be defined as: DFT kn kn F( k) n) cos jsin ; k,,,, n F( k) F ( k) jf ( k) x y () DHT 33

Vol. (), Jan,, pp 3-43, ISS: 35-543 DCT ( ) H k n kn kn n) cos sin ; k,,,, () C k n k n).cos n k / k k,,, where k otherwise (3) DST S k n k p n).sin n k,, where p k k / k otherwise (4).. DHT based on Direct Algorithm Let x (n) and H (k) be an 8-length Hartley Transform pair. The corresponding formulation in matrix form is [ H] [ T].[ X ] where [T ] is an 8 8 cas (cosine and sine) matrix [6]. Let S i ( ) i) i,,,,7 (5) The transform matrix for the 8-DHT is therefore: H () H () H () H (3) H (4) H (5) H (6) H (7).44.44.44.44.44.44 ) ).44 ) 3) 4) 5).44 6) 7) (6) We start by remarking initially that k( i / ) ki k ki cas cas k ( ) cas (7) 34

Vol. (), Jan,, pp 3-43, ISS: 35-543 Which follows from the addition of arcs formula: cas( ) cos cas sin cas Where casis the complementary cas function and ca s( ) cos( ) sin( ). Clearly, modules of components on the nd column are identical to the corresponding elements at the 6th column; the same is true for the 3rd and 7th column. We can thus consider new variables x x5 and x x5 instead of ) and 5), x x6 and x6 x instead of ) and 6), and so on. S () ( ) 6)), S S () ( ) 5)), S 4 S () ( 3) 7)), S 6 5 3 () ( ) 5)) 7,,, S () ( ) 4)), S () ( ) 4)) () ( ) 6)) () ( 3) 7)) (8) The first-order pre-additions as defined above always yield at least a half of vanishing elements in the new transform matrix. Although such an implementation requires only two multiplications, we may go further and combine other columns. Therefore S () ( ) 4)), S () ( ) 4)) S () ( ) 6)), S { S 4 5 6 7 () ( ) 5)) ( 3) 7)), 3, () ( ) 6)) S () ( ) 5)) ( 3) 7)), S () ( ) 5)) ( 3) 7)), S () ( ) 5)) ( 3) 7))}. (9) H () H () H () H (3) H (4) H (5) H (6) H (7).77.77.77.77.77 S () S ().77 S () S3().77 S 4 () S5 ().77 S () 6 S7 () () 35

Vol. (), Jan,, pp 3-43, ISS: 35-543.. An Algorithm for the DFT Implemented by DHT According to the definition of DFT and DHT the DFT data Sequence is given by the following relation: DHT( k) DHT( k) Re( DFT ( k)) () DHT( k) DHT( k) Im( DFT ( k)) ().3. Fast Cosine Transform based on Direct Algorithm According to the definition of DCT, for a given data sequence {n) :n,,,, -}, the DCT data sequence {C(n) :n,,,, -} is given by (equation (3)). The discrete Cosine Transform is defined as a matrix multiplication which is illustrated below [7-8]. C() C(4) C() C(6) C() C(5) C(3) C(7) ) ) 4) 6) 7) 5) 3) ) Where, cos( /8), sin( / 8), cos( /6), cos( 3 /6), sin( 3 /6), and sin( /6). (3).4. An Algorithm for the DST Implemented by DCT In this part a method of composing the discrete sine transform from the discrete cosine transform is demonstrated. Let x (n): n=,,,, -, be a sequence of data values [9]. Substituting m=-k; k=, into the discrete cosine transform (equation (5)), results in: Since C( k) c k,,, k n (n )( k) n)cos (4) 36

Vol. (), Jan,, pp 3-43, ISS: 35-543 Therefore (n )( k) (n ) k cos cos(n ) cos (n ) k n (n ) k sin(n ) sin ( ) sin (5) Let C( k) c k n n (n ) k ( ) n)sin (6) k,,, n n) ( ) n); n,,,, (7) And C (m) be the discrete cosine transform of sequence n) then: C ( k) c c k,,, k k n n n (n ) k ( ) n)sin (n ) k n)sin S( k) Where, S (k) is the discrete sine transform (DST) of the sequence x (n). Therefore, the procedure for obtaining the sine transform of the sequence x (n) is composed of three steps.. Change the signs of all odd numbered data to the opposite sign to form a new sequence x (n). (otice that the sequence number is counted from zero).. Compute the discrete cosine transform on the sequence x (n). 3. By reversing the sequence order of data which were produced by step, the discrete sine transform of the sequence n) is obtained. This procedure may be represented in the form of matrix multiplication. Let S and (8) C be the discrete sine and cosine transforms of order, respectively [3-5]. Then I C D S (9) 37

Vol. (), Jan,, pp 3-43, ISS: 35-543 WhereI is the opposite diagonal identity matrix, andd is the odd sign changing matrix defined as: D () 3. Coprocessor Architecture 3.. DCT_DST Block The DCT block is first implemented according to the direct Algorithm (equation (3)) and then we have used this DCT block to implement the DCT_DST block (equation (9)). Figure. illustrates the proposed architecture for DCT_DST block. If the "S" input signal has the logic value of zero, the DCT transform would be applied on the input data vector and if the "S" input signal has the logic value of one, the DST transform would be applied on the input data vector. Fig.. Block diagram for proposed DCT_DST architecture 38

Vol. (), Jan,, pp 3-43, ISS: 35-543 3.. DHT_DFT Block First, the DHT block is implemented according to the Direct Algorithm using its matrix form in (equation ()) and then it is used to implement the DHT_DFT block (equations (), ()). Figure. is our proposed architecture. The hardware is extracted from this data flow diagram. If the "S" input signal has the logic value of zero, the DHT transform would be applied on the input data vector and if the "S" input signal has the logic value of one, the DFT transform would be applied on the input data vector. The "I" signal is also used to select the real or imaginary part of the DFT transform. This signal is just for understanding the block diagram and is ignored in the top module. Fig.. Block diagram for proposed DHT_DFT architecture Figure. 3 shows the synthesized block diagram of the Coprocessor. Figure 4. illustrates the simulation result of this module. During this simulation all of the four transforms of this coprocessor have been applied to an eight-bit data input. If the "T_SEL" signal has the hexadecimal value of "", the outputs will be zero. Having the value of "", the "T_SEL" signal will lead the DST transform to the output. 39

Vol. (), Jan,, pp 3-43, ISS: 35-543 The DCT transform will appear on the output when the "T_SEL" signal has the value of "". The values of "3" and "4" will lead the DFT and DHT transforms on the output, respectively. 3.3. Implementation Results The whole architecture including the computation and data path is modeled at Register Transfer Level in VHDL, simulated and tested by a test bench using ModelSim simulator and implemented in XC3S7An-4FG484 FPGA device of Xilinx Company. The Simulation result of the proposed coprocessor has been shown on Figure 4.The Hardware description of this architecture for DCT, DST, DHT and DFT implementations of transform length '8' was synthesized using Xilinx Series FPGA tool (ISE) and mapped on the XC3S7An-4FG484 FPGA chip. In the 8-bit coprocessor implementation, the worst delay time is about 48 ns and thus a frequency of MHz is achieved. The routed IP takes total of 346 Slices which is 58 percent of the chip. The total number of I/Os used in the design is 38 which are 88 percent of the total I/Os of this chip. Conclusion This paper has proposed an efficient mapping on FPGA of a common Coprocessor. The DFT algorithm is implemented by DHT, which is based on Direct Algorithm. The Direct fast DCT algorithm is presented and then a method of computing the discrete sinetransform from the discrete cosine transform is demonstrated. For the future work we can implement this coprocessor using DCT as the base transform for implementing other transforms to obtain more surface reduction. References [] J. Davidson, "FPGA Implementation of a Reconfigurable Microprocessor", IEEE Custom Integrated Circuits Conference, p.p.:3..-3..4, 9- May 993. [] Javier Valls Martin Kuhlmann Keshab K. Parhi, "Efficient Mapping Of CORDIC Algorithms On FPGA" IEEE Workshop on Signal Processing Systems, p.p.336-345, -3 Oct.. 4

Vol. (), Jan,, pp 3-43, ISS: 35-543 [3] B.Das, S.Banerjee, "Unified CORDIC-based chip to realize DFT/DHT/DCT/DST", IEE Proc. Comput. Digit. Tech., Vol. 49, o. 4, July. [4] A.S Dhar, S. Banerjee, "An array architecture for fast computation of discrete hartley transform", IEEE Trans. Circuits Syst., Vol. 38, o. 9, pp. 95-98, 99. [5] Angarita, F.; Perez-Pascual, A.; Sansaloni, T.; Vails, "Efficient FPGA implementation of CORDIC algorithm for circular and linear coordinates", International Conference on Field Programmable Logic and Applications, p.p.535-538, 4-6 Aug. 5. [6] A.C. Erickson, B.S. Fagin, "Calculating the FHT in hardware", IEEE Trans. Signal Processing, Vol.4, p.p.34-353, June 99. [7] H. EL-Bannai, A. A. EL-Fattah, W. Fakhr, "An efficient implementation of the D DCT using FPGA technology", ICM 3, Dec. 9-, Cairo, Egypt, 3. [8].Ahmed, T.atarajan, K.R.Rao, "Discrete cosine transform", IEEE Transactions on Computers, Vol. C- 3, Issue, p.p. 9-93, Jan. 974. [9] J. Astola, D. Akopian, "Architecture-oriented regular algorithms for discrete sine and cosine transform", IEEE Trans. Signal Processing, Vol. 47, p.p.9-4, Apr. 999. 4

Vol. (), Jan,, pp 3-43, ISS: 35-543 Fig. 3. Synthesized circuit of the Coprocessor 3

Vol. (), Jan,, pp 3-43, ISS: 35-543 Fig. 4. Simulation result of the Coprocessor 33

Vol. (), Jan,, pp 3-43, ISS: 35-543 3