Hardware Realization of FIR Filter Implementation through FPGA NAME-: ESHWARARAO BODDEPALLI, B. Tech E.C.E., (M. Tech) VLSI System Design. NAME-: LOESHRAJU VYSYARAJU, M.Tech, Dept. of E.C.E., Assoc. Professor. ADITYA INSTITUTE OF TECHNOLOGY AND MANAGEMENT, TEALI, A. P., INDIA ABSTRACT: - Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in FPGA. It is a powerful technique for reducing the size of a parallel hardware. When DA (Distributed Arithmetic) algorithm is directly applied to the FPGA (field programmable gate array) to realize FIR (finite impulse response) filter, it is difficult to achieve the best configuration in the coefficient of FIR filter, the storage resource and the computing speed. According to this problem, the paper provides the detailed analysis and discussion in the algorithm, the memory size and look-up table speed. Also, the corresponding optimization and improvement measures are discussed and the concrete hardware realization of the circuit is presented. The required size of memory with improved algorithm is M/4 + M/4 = M/4 +, where it is with traditional one is M-, its memory scale is only - 3M/4+ times of the original. Through the algorithm improvement, the hardware resource is reduced and the operation speed is improved. In this project a 6 th order FIR filter is proposed to be implemented. Design, Implementation and Verification are aimed in this project. XILINX s Spartan 3E FPGA is targeted for this implementation. XILINX ISE Foundation (9.iSE (or) 0.iSE (or).ise) software is used for the FPGA design flow which includes Synthesis, Translation, Mapping, Floor planning, Placing and Routing, Post Place and Route simulation and Bit file generation. The results of simulation and the test show that this method greatly reduces the FPGA hardware resource and the high speed filtering is achieved. The design has a big breakthrough compared to the traditional FPGA realization. EY TERMS: Improved DA algorithm, FPGA, Xilinx 0.SE, Look-Up Table and Bit Level Rearrangement.. INTRODUCTION: DA algorithm is simply known as Distributed Arithmetic algorithm. Which is invented and proposed by Crosier in the year of 973? Distributed arithmetic algorithm is best and efficient technique for calculation of sum of products or multiple and accumulation (MAC) applications. The main advantage of the distributed arithmetic algorithm is it s the best analyzer of data path circuits while in designing. And one more fabulous advantage of distributed arithmetic algorithm is hardware required is reduced up to 80% while comparing with and without usage of (DA) Distributed arithmetic algorithm. Sometimes by using distributed arithmetic algorithm the total hardware requirement of design in a Digital signal processing circuit will be reduced up to less than 50%. Actually it s an old technique that was introduced and proposed by the Crosier in the year of 973. But, in recent days, digital signal processing (DSP) circuits are implementation using field programmable gate array (FPGA) has a great advantage. But by using the (DA) distributed arithmetic algorithm, it gives great advantage for the hardware implementation of Digital signal processing circuits using field programmable gate array (FPGA). Due to this only now-a-days (DA) distributed arithmetic algorithm having great demand. By using (DA) distributed arithmetic algorithm, we can implement (MAC) multiple and accumulator system. For implementing (MAC) multiple and accumulation system, (DA) distributed arithmetic algorithm uses basic building of (FPGA) field programmable gate array like (LUTs) look-up tables.. DESCRIPTION OF MAC OPERATION: The name itself stands for (MAC) multiplier and accumulation operation. The name Multiply stands for the operation of the multiplication and Accumulation stands for the addition. Both the operations of multiplications and accumulation are done simultaneously is known as (MAC) multiply and accumulation operation. The following expression represents that the (MAC) multiply and accumulation operation All Rights Reserved 0 IJARECE 5
y A x A x i. e. k A x A k x k Where A is a matrix of Constant values. X is a matrix of input variables. accumulator operation. And not only is that basically a Bit-Level Rearrangement. Means calculating the value of first product result (A 0.X 0 ) and then the second product result (A.X ), then immediately first and second product results are added. Then go for the third product and calculated and produces result and immediately added to the first two products resultant addition value. (ROM) Read only memory look-up tables calculated the calculations and expressed to outside that how the calculations are done. Each A k is having M-bits. Each X k is having N-bits. y should be a memory element y should be able to store the resultant value of an expression. Example: where A = [, 4, 6, 8] and X = [, 3, 5, 7] where =4. Solution: y = x+4x3+6x5+8x7 y = + + 30 + 56 y = 4 + 86 = 00. Below figure shows that the hardware requirement for (MAC) multiplier and accumulator.. POSSIBLE HARDWARE: Let A = [C,, C3, C4] and X = [A, B, C, D] where the value of = 4. By using (DA) distributed arithmetic algorithm we can hide the exposure of (ROM) read only memory look-ups calculation. By using this, the hardware requirement is going to be reduced. DA is usually defined as computation using Look-Up table. The main application of DA is the dot-product computation of two vectors, where one of the two vectors is constant (i.e. all the elements are constant values). In this case, all additions in which at least one element of the constant vector is involved are precomputed and stored in a Look-Up table. At run-time, the elements of the variable vector are used to address the Look-Up table and retrieve partial sums in a bit-serial manner. One of the notable contributions in DA has been done by White. He proposed the use of ROMs to store the precomputed values. The surrounding logic to access the ROM and retrieve the partial sums has to be implemented on a s0eparate chip. Because of this moribund architecture, the DA method could not be successfully used. With the appearance of SRAM (Static Random Access Memory) based FPGAs, the DA became an interesting alternative to implement signal processing application in FPGA. Because of the availability of SRAMs in those FPGAs, the precomputed values could now be stored in the same chip as the surrounding logic. This process is not always easy and can be time consuming. On the other hand, fixed-point format is used to represent real numbers. This results in the loss of accuracy as well as the limitation of the numbers range. We have developed a framework to help designers in the development of signal processing applications using the DA. Moreover we are able to handle real number in the IEEE 754 floating point format. 3. REDUCING THE MEMORY SIZE: Where A, B, C, D are the shift registers. DISTRIBUTED ARITHMETIC (DA) ALGORITHM: Basically (DA) distributed arithmetic algorithm is a Bit-Serial in nature. Calculating the resultant bits in serially only. It operates based on (MAC) multiple and 3. Memory Partitioning: One of several possible ways to reduce the memory size is to partitioning the memory into smaller pieces of memories that are added before the shift accumulator. The amount of memory reduced from N words to. N/ words if the original memory is partitioned into parts. Below figure shows that All Rights Reserved 0 IJARECE 6
the arrangement of memory partitioning into memories in hardware implementation. 3. Memory Coding: The second approach is based on a special coding of the ROM content. Memory size can be halved by using the ingenious scheme based on the identity X = ½ [x (x)] The ROM content is In two s compliment representation the identity can be written If a i XOR b i = the F values are applied directly to the accumulators, and IF a i XOR b i = 0 the F values are interchanged. The F values are either added to, or subtracted from, the accumulator s registers depending on the data bits a i and b i. 4. IMPROVED DESIGN OF THE DA ALGORITHM: Notice that (x k x k ) can only take on the values of (- ) or (+ ). By inserting this expression into the Inner product yields Where Fk(xk, xk,..xnk) = The function F k is shown in the table for N = 3 X X X3 F 0 0 0 -A-A-A3 0 0 -A-A+A3 0 0 -A+A-A3 0 -A+A+A3 0 0 A-A-A3 0 A-A+A3 0 A+A-A3 A+A+A3 Anti-Symmetry can be occurs at 0. Notice that only half the values are needed, since the other half can be obtained by changing the signs. The pixels that are multiplied by the same coefficient area added (or subtracted). From Eq. (), Xm can be expressed as Eq. (4). x k [ x k ( x Where the Xm can be expressed as Eq. () according to the binary complement operation [3]. N n ( N ) xk bk 0 bkn n The step by step derivation can be calculated and then the result could be estimated like.. N n ( N ) xk x b k 0 bkn b kn n For convenience, two variables are defined as follows: φ m0 = - (x m0 -x mo ) φ mn = - (x mn -x mn ) In which, as the value of xmn is 0 or a, so the value of φ mn and φ m0 is ±. Then Eq. (6) can be expressed as Eq. (7). N n ( N ) x k c kn As there are M n0 different kinds of results of y k y k A k x k N n ( N ) y A kckn Ak n0 k k And the value of φ mn is ±, so the results show positive and negative symmetry property. If the positive and negative sign are not considered, there are only M- different kind of results and the size of storage will reduce by half. k )] (6) (6) N A k c n0 kn n ( N ) All Rights Reserved 0 IJARECE n y N n ( N ) A kckn Ak 0 k k (9) 7
In which, z y, y b, b a+, a>, so an inner product operation with the scale of M will be realized through several LUTs with different or same depth and adders. The scale of the memory is a + b-a +. + z-y + M/-z. For example, if using two LUTs with depth of M/4 and adders to achieve it, namely, Then the size of memory is M/4 + M/4 = M/4+. Compared with the memory size which is M- before optimizing, its memory scale is only -3M/4+ times of the original. The simplified hardware circuit structure is as shown in below Fig.. Figure. The circuit structure through the algorithm improvement 5. THE CIRCUIT DESIGN OF FIR FILTER: A. Design Index And Parameters Extraction: A 6 th order FIR filter is designed. Its parameters are as follows: the sampling frequency is.5mhz; the pass band cut-off frequency is 00 Hz; the width of the input data, the output data and the filter coefficient is 8, 6 and 0 bits respectively. It adopts Hamming window to design and MAT Lab simulation to calculate its unitsampling response h(k) and simply it 6 times. The h(k) is as follows.. H(0) = H (5) = 98D;H() = H(4) = 578D H() = H (3) =364D;H(3)=H()= 78D H(4) = H () =4503D;H(5)=H(0)= 6400D Fig.: The circuit structure of FIR system When using the DA algorithm to implement the linear time-invariant system, the algorithm is optimized according the method of section. The pre-storing value corresponding to the upper half of the memory address of LUT storage will be the negative of the lower half and then the LUT reduces by half using symmetry. Meanwhile, the address is used as Ctrl control-adding-decrease implement to complete the positive and negative conversion between the pre-storing value corresponding to the upper and lower half of it. According to result of the improvement and optimization, the LUT is divided into two 4-input LUTs and the address maker circuit divides the input signals into 4 segments in accordance with the 4-input LUT. The data buffer can be established according to the order of the filter. As the designed filter is a 6 th order one, so the sampled serial data can be sent to the 0 bits serial-inparallel-out shift register, and then the data is divided and sent to the LUT in turn. C. Circuit Simulation And Testing: The input sequence is x(n) = [0, 3,,, 0,,, 4, 3,,, 0,,,, 3] and the simulation waveform is shown in figure 3. H(6) = H(09) =7996D;H(7)= H(8) =8908D Fig 3: The simulation waveform B. The Hardware Circuit Unit: The address maker circuit generated the LUT address. The upper half of the address looks up its corresponding pre-storing value. The hardware circuit is shown in Fig.. All Rights Reserved 0 IJARECE The filter input/output in the waveform uses hexadecimal representation. The designed results are consistent with what we desired. The implementation of filter based on FPGA is realized by the DA algorithm and the improved DA algorithm separately. The DA algorithm and improved DA algorithm is used to implement filter. The improved algorithm can greatly reduce the hardware resource and improve the throughput efficiently. It meets the design requirements entirely. 8
6. HARDWARE ARCHITECTURE: CONCLUSIONS: The below figure shows that the internal architecture of the FIR filter design present here using techniques. It will give us the realistic view of the internal architecture of the FIR filter design using Improved DA algorithm implementation using Field Programmable Gate Array (FPGA) is can be developed using Verilog hardware Descriptive Language and it can be developed by using the Spartan 3E S350E hardware kit. It can realize the hardware requirement of the FIR filter while developing with and without Improved DA algorithm. Fig-4: Hardware design of design project DA is a very efficient means to mechanize computations that are dominated by inner products. DA has always fared well, not always (but often) best, and never poorly. DA is a very efficient mechanism for computations that are dominated by inner products (Convolution). If performance/cost ratio is critical, DA should be seriously considered as a contender. The complicated multiplication-accumulation operation is converted to the shifting and adding operation when the DA algorithm is directly applied to realize FIR filter. Aiming at the problems of the best configuration in the coefficient of FIR filter, the storage resource and the calculating speed, the DA algorithm is optimized and improved in the algorithm structure, the memory size and the LUT speed. The arithmetic expression has clear layers of derivation process and the circuit structure is reasonable, which make the memory size smaller and the operation speed faster. The design improves greatly compared to the conventional FPGA realization and it can be flexible applied to implement high-pass, low-pass and bans-pass filters by changing to the order and the LUT coefficient. REFERENCES: [0] A. Peled and B. Liu, A New Hardware Realization of Digital Filters, IEEE Trans. On A.S.S.P., Vol. ASSP-, pp.456-46, December 974. [] S.A.White Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. IEEE ASSP Magazine, Vol.6, No.3, pp. 4-9. [03] B. New, A Distributed Arithmetic Approach to Designing Scalable DSP Chips, Electronic Design News, August 7, 995. Fig-5: Clear cut view of hardware design of package These figures shows that the programmed hardware Field Programmable Gate Array (FPGA) implementation of Finite Impulsive Recursive (FIR) Filter. Here, we can easily identify that the Improved Discrete Arithmetic Algorithm can be utilized in Finite Impulsive Recursive Filter designed. By observing the above TWO diagrams, we can easily identify that the hardware realization of Finite Impulsive Recursive Filter (FIR) can be reduced using Field Programmable Gate Array (FPGA) programming. [04] Mintzer, L. FIR filters with Xilinx FPGA. FPGA 9 ACM/SIGDA, Workshop on FPGAs. Pp.9-34. [05] W. Shang, B. W. Wah. Dependence Analysis and Architecture Design for Bit level Algorithms. Intl. Conf. On Parallel Process, vol. I, pp. 30-38, 993. [06] W. D. Little, A fast algorithm for digital filters, IEEE Trans. On communications, Vol. C-3,pp. 466-469, may 974. [07] C S Burrus, Digital filter Realization by Distributed Arithmetic, International Symposium on Circuits and Systems, Munich, April 976. All Rights Reserved 0 IJARECE 9
[08] D ammeyer, Digital Filter Realization in Distributed arithematic, Proc. European Conf. on Circuit Theory and Design, Genoa, Italy, September 976. [09] F J Taylor, AN Analysis of the Distributed Arithmetic Digital Filter, IEEE Trans. On A.S.S.P., Vol. ASSP-35, No.5, pp. 65-70, Oct. 986. [0].. Parthi, VLSI Digital Signal Processing Systems: Design and Implementation. Newyork: Wiley, 999. [] L. Zhuo and V.. Prasanna, Sparse Matrix-Vector Multiplication on FPGAs, International Symposium on Filed Programmable Gate Arrays (FPGA), Monterey, CA, 005. [0] C. L Wang, C. H. Wei and S. H. Chen, Efficient bitlevel systolic array implementation of FIR and IIR digital filters, IEEE Journal on Selected Areas in Communications, Vol. 6, Iss. 3, pp. 484-483, April 988. [] Z. Wu, C. Luo, X. Su and X. Xu, Digital filter implementation for software radio, IEEE VTC 00 Spring, Vol. 3, pp. 90-906, 00. [] L. Mintzer, Digital filtering in FPGAs, Conference Record of the 8th Asilomar Conference on Signals, Systems and Computers, vol., pp. 373-377, 994. [3] Altera Corporation, APEX 0 Programable Logic Device Family Data Sheet, Ver. 4.3, Feb. 00. []. Chapman, Constant Coefficient Multipliers for the XC4000E, Xilinx Technical Report 996. [3] M. J. Wirthlin, Constant Coefficient Multiplication using Look-Up Tables, Journal of VLSI Signal Processing, Vol. 36, pp. 7-5, 004. [4] alyani, A Novel Distributed Arithmetic Based Algorithm and its Implementation for LTE Standard, European journal of scientific research, ISSN 450-6X Vol.70 No.4 (0), pp. 68-636. [5] V. Sudhakar, N. S. Murthy, L. Anjaneyulu, Area Efficient Pipelined Architecture For Realization of FIR Filter Using Distributed Arithmetic, 0 International Conference on Industrial and Intelligent Information (ICIII 0), IPCSIT vol.3 (0) (0) IACSIT Press, Singapore. Author Description: This is Eshwararao Boddepalli, completed my Bachelor of technology in Electronics and Communication Engg. Pursuing master of technology in the stream of VLSI System Design. My research area is VLSI and FPGA using DSP implementations. This is Lokeshraju Vysyaraju, completed my Master of Technology. Now, I am working as an Assoc. professor in the Department of Electronics and Communication Engineering, Aditya Institute of Technology and Management, Andrapradesh, India. [6] A P Ramesh, G Nagarjuna and G Siva Raam, FPGA based Design and Implementation of Higher Order FIR Filter using Improved DA Algorithm, International Journal of Computer Applications (0975 8887), Volume 35 No.9, December 0. [7] Suvarna Joshi and A Bharathi, FPGA BASED FIR FILTER, Suvarna Joshi et al. / International Journal of Engineering Science and Technology, Vol. (), 00, 730-733. [8] T. J. Moeller and D. R. Martinez, Field programmable gate array based radar front-end digital signal processing, Seventh Annual IEEE Symposium on FCCM '99, pp. 78-87, 999. [9] W. S. Song, VLSI bit-level systolic array for radar front-end signal processing, Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systemsand Computers, vol., pp. 407-4, 994. All Rights Reserved 0 IJARECE 30