Two High Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic

Similar documents
Fast Block LMS Adaptive Filter Using DA Technique for High Performance in FGPA

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Batchu Jeevanarani and Thota Sreenivas Department of ECE, Sri Vasavi Engg College, Tadepalligudem, West Godavari (DT), Andhra Pradesh, India

Implementation of High Speed FIR Filter using Serial and Parallel Distributed Arithmetic Algorithm

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

DUE to the high computational complexity and real-time

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

Hardware Realization of FIR Filter Implementation through FPGA

A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithmetic with Decomposed LUT

Fixed Point LMS Adaptive Filter with Low Adaptation Delay

Fault Tolerant Parallel Filters Based On Bch Codes

Critical-Path Realization and Implementation of the LMS Adaptive Algorithm Using Verilog-HDL and Cadence-Tool

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 10 /Issue 1 / JUN 2018

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

Low-Power FIR Digital Filters Using Residue Arithmetic

The Efficient Implementation of Numerical Integration for FPGA Platforms

Introduction to Field Programmable Gate Arrays

An Efficient Implementation of Fixed-Point LMS Adaptive Filter Using Verilog HDL

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

Power and Area Efficient Implementation for Parallel FIR Filters Using FFAs and DA

Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms

Design Optimization Techniques Evaluation for High Performance Parallel FIR Filters in FPGA

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

Low Complexity Opportunistic Decoder for Network Coding

ABSTRACT I. INTRODUCTION. 905 P a g e

Implementation Of Quadratic Rotation Decomposition Based Recursive Least Squares Algorithm

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

International Journal for Research in Applied Science & Engineering Technology (IJRASET) IIR filter design using CSA for DSP applications

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Parallel FIR Filters. Chapter 5

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Design of Fir Filter Architecture Using Manifold Steady Method

AnEfficientImplementationofDigitFIRFiltersusingMemorybasedRealization

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Reduction of Latency and Resource Usage in Bit-Level Pipelined Data Paths for FPGAs

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

THE orthogonal frequency-division multiplex (OFDM)

Implementing FIR Filters

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

Implementation of a Low Power Decimation Filter Using 1/3-Band IIR Filter

TRACKING PERFORMANCE OF THE MMAX CONJUGATE GRADIENT ALGORITHM. Bei Xie and Tamal Bose

Image Compression System on an FPGA

HIGH SPEED REALISATION OF DIGITAL FILTERS

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

LOW COMPLEXITY SUBBAND ANALYSIS USING QUADRATURE MIRROR FILTERS

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

Implementation and Impact of LNS MAC Units in Digital Filter Application

Adaptive Filtering using Steepest Descent and LMS Algorithm

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT

Reduced Complexity Decision Feedback Channel Equalizer using Series Expansion Division

On Designs of Radix Converters Using Arithmetic Decompositions

Performance of Error Normalized Step Size LMS and NLMS Algorithms: A Comparative Study

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank

COMPARISON OF DIFFERENT REALIZATION TECHNIQUES OF IIR FILTERS USING SYSTEM GENERATOR

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

Operators to calculate the derivative of digital signals

An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor

A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

High Level Abstractions for Implementation of Software Radios

Volume 5, Issue 5 OCT 2016

Binary Adders. Ripple-Carry Adder

Key words: B- Spline filters, filter banks, sub band coding, Pre processing, Image Averaging IJSER

An Optimized Montgomery Modular Multiplication Algorithm for Cryptography

Fault Tolerant Parallel Filters Based on ECC Codes

Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

REAL TIME DIGITAL SIGNAL PROCESSING

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

MODULO 2 n + 1 MAC UNIT

Low-Power, High-Throughput and Low-Area Adaptive Fir Filter Based On Distributed Arithmetic Using FPGA

Performance Analysis of Adaptive Filtering Algorithms for System Identification

LOW-POWER SPLIT-RADIX FFT PROCESSORS

FPGA IMPLEMENTATION OF HIGH SPEED DCT COMPUTATION OF JPEG USING VEDIC MULTIPLIER

ARITHMETIC operations based on residue number systems

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

Design and Implementation of 3-D DWT for Video Processing Applications

Realization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC

Transcription:

Two High Performance Adaptive Filter Implementation Schemes Using istributed Arithmetic Rui Guo and Linda S. ebrunner Abstract istributed arithmetic (A) is performed to design bit-level architectures for vector-vector multiplication with a direct application for the implementation of convolution, which is necessary for digital filters. In this work, two novel A based implementation schemes are proposed for adaptive FIR filters. ifferent from conventional A techniques, our proposed schemes use coefficients as addresses to access a series of look-up tables (LUTs) storing sums of delayed and scaled input samples. Two smart LUT updating methods are developed, and least mean square (LMS) adaptation is performed to update the weights, and minimize the mean square error between the estimated and desired output. Results show that our two high performance designs achieve high-speed, low computation complexities, and low area costs. Index Terms adaptive filter, distributed arithmetic (A), finite impluse response (FIR), least mean square (LMS), look-up table (LUT), multiply-accumulate (MAC), offset-binary coding (OBC). I. INTROUCTION Most portable electronic devices, such as cellular phones, PAs, and hearing aids, require digital signal processing (SP) for high performance. ue to the increased demand of implementation of sophisticated SP algorithms, low-cost designs, i.e., low-area and low-power cost, are needed to make these hand-held devices small with good performance. Various types of SP operations are employed in practice. Filtering is one of the most widely used complex signal processing operations []. For discrete finite impulse response (FIR) filters, the output y(n) is a linear convolution of weights w n, and inputs. For an N-order FIR filter, the generation of each output sample y(n) takes N multiply-accumulate (MAC) operations. Since general purpose multipliers require significant chip area, alternate methods of implementing multiplication are often used particularly when the coefficients values are known prior to implementation. istributed arithmetic (A) is one way to implement convolution multiplierlessly, where the MAC operations are replaced by a series of LUT accesses and summations. Techniques, such as ROM decomposition [] and offset-binary coding (OBC) [7] can reduce the LUT size, which would otherwise increase exponentially with the filter length N for conventional A. However, in many applications, such as echo cancellation and system identification, coefficient adaptation is needed. This adaptation makes it challenging to implement A-based adaptive filters with low-cost due to the necessity of updating of LUTs. Several approaches have been developed for Abased adaptive filters, from the point of view of reducing logic complexity [3] [6], [8]. Recently, a A-based FIR adaptive filter implementation scheme is presented in [5], [6], [8], which uses extra auxiliary LUTs to help in the updating; however, memory usage is doubled. In this paper, two novel LMS adaptation based A implementation schemes are proposed for FIR adaptive filter implementation. The first proposed algorithm updates the LUTs in a similar way as described in [5], [6], [8] but without the need for auxiliary LUTs. The second proposed algorithm incorporates an OBC-based LUT updating scheme that further reduces memory usage. It is shown that our two proposed schemes both outperform that described in [5], [6], [8], with the second proposed algorithm requiring less memory usage, but more computation cost than our first proposed algorithm. This brief is organized as follows. Section II describes the background of A and OBC. Then, we present our proposed schemes for A-based FIR adaptive filter in Section III. A performance comparison of different A-based implementations is made in Section IV. Our conclusions are given in Section V. II. BACKGROUN A. istributed Arithmetic (A) A was first studied by Croisier [9] in 973, and popularized by Peled and Liu []. Quantization effects in A system were analyzed by Kammeyer [] and Taylor []. Useful tutorials on A were provided by White [7] and Kammeyer [3]. A is used to design bit-level architecture for vector multiplications []. Traditionally, for filters implemented using A, the input samples are used as addresses to access a series of LUTs whose entries are sums of coefficients. Consider a discrete N-order FIR filter with constant coefficients, and input samples coded as B-bit two s complement numbers with only the sign bit to the left of the binary point x(n k) = x k x k () Using () to compute the FIR output gives y(n) = N w k x k [ N ] w k x k () With C = N w kx k, [, B ], and C = N w kx k, () can be rewritten as

y(n) = = C (3) The C values can be precomputed and stored in a LUT with the input used as the address. This technique allows the FIR filter with known coefficients to be implemented without general purpose multipliers. This implementation requires a LUT with a size that increases exponentially with the number of taps N, which results in a large time cost for accessing the LUT for a high order filter. So, reducing the LUT size improves system performance, as well as area cost. One possible way to reduce LUT size, called ROM decomposition, replaces a longer address by shorter addresses, and the data read from smaller LUTs is accumulated to generate the output. For a 64-tap FIR filter, by breaking the LUT with 64 entries into smaller LUTs with 4-bit addresses, only 64 4 4 = 8 entries are required. Table I LUT CONTENTS FOR 4-TAP FIR WITH OBC COING, MOIFIE FROM [] x x x x 3 LUT Contents -/(w w w w 3) -/(w w w w 3) -/(w w w w 3) -/(w w w w 3) -/(w w w w 3) -/(w w w w 3) -/(w w w w 3) -/(w w w w 3) /(w w w w 3) /(w w w w 3) /(w w w w 3) /(w w w w 3) /(w w w w 3) /(w w w w 3) /(w w w w 3) /(w w w w 3) x x... x K x S B. Offset-Binary Coding (OBC) OBC can be used to reduce the LUT size by a factor of to N [7]. By rewritting the input from (), OBC is derived as follows. with... ROM K words Y x(n k) = {x(n k) [ x(n k)]} (4) Shift right extra x(n k) = x k x k () (5) Substituting () and (5) into (4), x(n k) = [ (x k x k ) (x k x k ) () ] (6) By defining k as x k x k, the output from FIR filter can be written as: y(n) = N w k k w k k [ N efining E as N can be rewritten as w k k N k () = ] N w k () (7) w k k, and E extra as N w k, (7) y(n) = E E E extra () (8) The OBC scheme is described in (4) (8). The LUT contents for a 4-tap FIR filter are given in Table I. It can be observed that the first and second half of this LUT are mirrored vertically. Therefore, its size can be halved by using x to control the sign of each entry at the cost of a slightly increased hardware complexity. The hardware circuit for implementing a K-tap filter is shown in Fig.. Figure. A based bit-serial architecture for implementing K-tap FIR filter with OBC III. PROPOSE SCHEMES For an FIR filter with LMS adaptation, which involves the automatic update of filter weights in accordance with the estimation error, conventional A performance suffers from the intensive computation required to rebuild LUTs. Work has been done in [3] [6], [8], [4], [5] to reduce the computation workload for LUT updating by only recomputing a few LUT entries. In [5], [6], [8], the authors proposed an efficient LUT updating method that uses auxiliary LUTs, where only half of the entries need to be recomputed. The techniques proposed in this paper eliminate the need for auxiliary LUTs. Our first scheme uses a similar LUT updating method to [5], [6], [8], but with no need of auxiliary LUTs, since our proposed scheme stores the sums of delayed inputs in LUTs. In our second scheme, OBC is incorporated, and a new updating method is introduced to further reduce the memory usage. Unlike [5], the algorithms proposed in this paper implement the coefficient updating and filter operations concurrently. A. First Proposed Scheme Conventional A stores the sums of weights (coefficients) in LUTs and uses the inputs as addresses. This approach works very well for nonadaptive filters with constant coefficients. However, for adaptive filters, adaptation is necessary for the weights and the input registers must be updated. By exploiting the commutative property of convolution, the same filtering operation can be obtained by storing sums of delayed input S

3 samples in LUTs, and using the binary coefficients as addresses. With the inputs and coefficients having a common bit-width, as is often the case, this change does not increase latency; however, computation and memory cost are reduced for a bit-serial A design. If the inputs and coefficients have different lengths, there is some difference in latency, which we have considered in Section IV. Representing the weights w k in two s complement form using B bits gives w k = w k w k (9) Using (9) to calculate the output yields w3 w w w w3 w w w LUT Content LUT Content x( n) x( n ) x( n ) x( n) x( n ) x( n ) x( n) x( n ) x( n ) x( n ) x( n ) x( n) x( n 3) x( n 3) x( n) x( n 3) x( n ) x( n 3) x( n ) x( n) x( n ) x( n) x( n) x( n ) x( n ) x( n ) x( n ) x( n ) x( n) x( n ) x( n) x( n ) x( n ) x( n ) x( n ) x( n ) x( n) x( n ) x( n) x( n ) x( n 3) x( n ) iscarded, x( n ) x( n ) x( n 3) x( n ) x( n) due to x( n ) x( n ) x( n ) x( n 3) x( n ) x( n ) x( n 3) x( n ) x( n ) x( n) x( n 3) x( n ) x( n ) x( n) x( n ) x( n ) x( n) x( n ) t = n t = n Figure. LUT update from t = n to t = n, modified from [6] y(n) = N x(n k)w k [ N x(n k)w k ] () Similar to (), the term in square brackets has only N possible values, so a LUT can be used. The left table shown in Fig. gives the LUT values for a 4-tap FIR filter. When the time index t = n, () becomes, N y(n)= x(n k)w k [ N x(n k)w k ] () Fig. shows graphically how the LUTs can be updated. Specifically, it can be observed from the term in square brackets that the new input sample x(n) is not used for the new entries, whose least significant address bit (LSAB) w is. Since all the combinations of inputs x(n), x(n )... x(n N) are included in the old LUT at t = n, these new entries with LSAB being can be obtained directly by copying the corresponding entries from the old LUT, as indicated by the arrows in Fig.. A closer observation discloses that each of the rest of the new entries with LSAB being can be generated by adding x(n) to the prior entry in the new LUT. Mathematically, the new entries T i (n ) can be obtained from the old entries T i (n) by () and (3) with the entry index i [, N ]. T i (n ) = T i (n), i {i i mod = } () x[ n] Figure 3. x ( n ) Weights Updating A_LUT Entry Accumulating s y[ n] - d[ n] µ scaling Top-level circuit diagram for 4-tap adaptive FIR filter a a a a 3... clk 4 bit address ata _ in LUT with 4 entries clk ( a a a a ) 3 ( a a a ) 3 a p T i (n ) = T i (n ) x(n ), i {i i mod = } (3) In this paper, LMS adaptation is chosen to update the weights w k, as shown in (4), and (5). w k (n ) = w k (n) µe(n)x(n k) (4) e(n) = d(n) w(n)x(n) (5) where d(n) is the desired output, w(n) = [w (n) w (n)... w N (n)], x(n) = [x(n) x(n )... x(n N)], and e(n) is the error between the desired and estimated output. Figure 4. a etailed A_LUT block for 4-tap adaptive FIR filter The top-level circuit diagram of this proposed scheme for an example 4-tap FIR filter is shown in Fig. 3. The method for updating the A_LUT block in our first scheme is similar to that used for updating the auxiliary LUT in [5], [6], [8], which we refer to as A scheme. For our proposed first scheme, the A_LUT block could be implemented using a typical A implementation. Fig. 4 shows more details of the implementation. The Addr Gen block has to generate the addresses in the order as shown in Fig. 4. As shown in the

4 Area [mm ] 3 Proposed First Scheme A 8 6 3 64 (a) Proposed First Scheme A w w w3 w w w3 Figure 6. LUT Contents / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] / [ x( n) x( n ) x( n ) x( n 3) ] t = n x( n ) x( n 3) T = LUT update for 4-tap adaptive FIR filter LUT Contents [ ] - / x( n ) x( n) x( n ) x( n ) / [ x( n ) x( n) x( n ) x( n ) ] - / [ x( n ) x( n) x( n ) x( n ) ] / [ x( n ) x( n) x( n ) x( n ) ] - / [ x( n ) x( n) x( n ) x( n ) ] / [ x( n ) x( n) x( n ) x( n ) ] - / [ x( n ) x( n) x( n ) x( n ) ] / [ x( n ) x( n) x( n ) x( n ) ] t = n Figure 5. results. ALUTs (b) Logic Registers Area comparison. (a) Chip area and (b) 8-tap FIR filter synthesis weights update block in Fig. 3, the multiplication from (4) can be simplified as shifting by scaling the estimation error and assuming the step size to be a power of two. In contrast to A, our scheme uses no auxiliary LUTs, but only main LUTs. The updating of LUTs and weights used as addresses can be performed concurrently, which reduces latency. In A, two types of LUTs are necessary: the auxiliary LUTs need to be updated first, and then the updates of the main LUTs are executed. To compare our scheme with A, synthesis results from.8µm standard cell library for implementing FIR filters with 8-bit inputs and weights are presented in Fig. 5 (a). It is shown that since extra auxiliary LUTs are necessary for the A scheme, significant area savings can be achieved by our proposed scheme. Similar results are obtained in Fig. 5 (b) by implementing an 8-tap FIR filter using FPGA Stratix II EPS5F67I4. In addtion to the area advantage shown in Fig. 5, our proposed algorithm also has an advantage of reduced latency. B. Second Proposed Scheme In Section II-B, OBC is shown to reduce the number of LUT entries without increasing the number of LUTs required. In this section, we propose a new scheme that combines OBC with our first proposed scheme. Because of the commutative property, as in our first proposed approach, the sums of delayed input samples are stored in LUTs coded using OBC with binary coefficients as the address, as derived in (6) (9). (7) indicates that, with w k as the address, LUTs can still be used to store F, as shown on the left of Fig. 6 for a 4-tap FIR filter. Although the size of LUTs is reduced by applying OBC, the LUT updating still suffers from high computation cost, since the oldest sample is included in every entry, e.g., x(n 3) in Fig. 6. The second entry with address at time t = n, needs to be updated by adding x(n 3) x(n )x(n). The rest of the entries need to be updated with approximately the same computation cost. w k = w k w k (6) F = N x(n k)(w k w k ) (7) F extra = N x(n k) (8) y(n) = F F F extra () (9) To reduce the computation workload, we propose a smart updating algorithm. Fig. 6 shows how the update works for a 4-tap FIR filter, with precomputed T = x(n)x(n 3). For instance, the entry with address on the left for t = n is used to update the entry with address on the right as follows. T [x(n) x(n ) x(n ) x(n 3)] = [x(n ) x(n) x(n ) x(n )] () To update F extra, we could add together all of the first entries from the sub-luts used by ROM decomposition, which requires num additions, where num is the number of sub-luts. Another method is by subtracting half of the oldest input sample from F extra, and adding half of the newest sample to F extra, which requires operations every cycle, since division by can be realized by right shifting. The latter method is chosen to update F extra for this proposed scheme. Performances for the different schemes are compared in Section IV. IV. PERFORMANCE COMPARISON In this section, formulas for certain critical measurements are derived to compare the performances of our proposed two schemes with A. Since these three implementation schemes have a similar clock rate, the memory usage and computation cost for each scheme are derived and compared. The memory usage is measured by the numbers of LUT entries. If step size is assumed and estimation error is scaled to be a powerof-two to simplify multiplication as a shift, the only critical computation used for these three schemes is addition. The computation cost is estimated as the number of necessary additions per filter cycle, including LUT updating and data filtering. The memory usages for A, and our two proposed schemes, M A, M proposed, and M proposed, are estimated: M A = ( m )( K m ) ()

5 Savings over A by First Algorithm Savings over A by Second Algorithm.5.4.3. W/B=.5 W/B=.75 W/B= W/B=.5 W/B=.5. 5 5 5 3 35 K/m (a).4.3.. 5 5 5 3 35 K/m (b) Figure 7. (a) Savings by proposed first scheme and (b) Savings by proposed second scheme. V. CONCLUSION In this work, two different A based schemes are presented for FIR adaptive filter implementation. In contrast to conventional A based schemes, our schemes store the sums of delayed input samples in LUTs and use the binary coefficients as addresses. It is shown that since no auxiliary LUTs are required, our first proposed scheme needs exactly half the memory usage required by previous work, while the second proposed scheme only needs less than 3% of that required by previous work. In addition, our two proposed schemes both have low computation cost, with the second scheme requiring slightly more addition operations than the first one. Unlike previous work, in our schemes, the updating of LUTs and coefficients can be executed concurrently, which enables low latency. M proposed = ( m )( K m ) () M proposed = ( m )( K m ) (3) where K is the filter length and m is the number of bits required for the LUT address when ROM decomposition is used. K and m are assumed to be powers of two. Since our proposed first scheme does not use any auxiliary LUTs, its memory usage is exactly half of A s. In our second proposed scheme, the memory usage is further reduced by using OBC for the LUTs. With OBC coded LUTs, our second proposed scheme requires the least memory usage, which is less than 3% of that by A, since M proposed M A = 3 ( 4 ) = 6.6%. If the coefficients and inputs have different lengths, then the savings ratio will vary somewhat. If W and B are the bit-width of inputs and coefficients, respectively, then the numbers of necessary additions every filter cycle for these three schemes are estimated: A A = m ( K m ) m ( K m ) (K m ) W (4) A proposed = m ( K m ) K (K m ) B (5) A proposed = ( m )( K m ) K (K m ) B (6) The three addends in (4) (6) count the additions for updating the LUTs, coefficients, and summing up all the entries read from LUTs, respectively. To examine the effects of changing the ratio of the input width W and the coefficient width B, we plot the savings of additions for different values in Fig. 7. It is shown that our two proposed schemes require less addition costs than A, while the proposed second scheme needs slightly more additions than our first proposed approach, which is due to the precomputation of T, and updating of F extra every cycle. REFERENCES [] S.K. Mitra, igital Signal Processing: A Computer Based Approach, nd ed, New York: McGraw Hill Companies,. [] K.K. Parhi, VLSI igital Signal Processing Systems: esign and Implementation, John Wiley, 999. [3] C.H. Wei and J.J. Lou, Multimemory block structure for implementing a digital adaptive filter using distributed arithmetic, IEE Proceedings, vol. 33, Pt. G, no., pp. 9 6, February, 986. [4] C.F.N. Cowan and J. Mavor, New digital adaptive-filter implementation using distributed-arithmetic techniques, IEE Proceedings, vol. 8, Pt. F, no. 4, pp. 5 3, February, 98. [5].J. Allred, H. Yoo, V. Krishnan, W. Huang, and.v. Anderson, LMS adaptive filters using distributed arithmetic for high throughput, IEEE Transactions on Circuits and Systems, vol. 5, no. 7, pp. 37 337, July, 5. [6].J. Allred, H. Yoo, V. Krishnan, W. Huang, and.v. Anderson, A novel high performance distributed arithmetic adaptive filter implementation on an FPGA, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. V 6 4, 4. [7] S.A. White, Applications of distributed arithmetic to digital signal processing: A tutorial review, IEEE ASSP Magazine, vol. 6, pp. 4 9, July, 989. [8].J. Allred, H. Yoo, V. Krishnan, W. Huang, and.v. Anderson, An FPGA implementation for a high throughput adaptive filter using distributed arithmetic, th Annual IEEE Symposium on Field- Programmable Custom Computing Machines, pp. 34 35, 4. [9] A. Croisier,. Esteban, M. Levilion, and V. Rizo, igital filter for PCM encoded signals, U.S. Patent No. 3,777, 3, issued April, 973. [] A. Peled and B. Liu, A new hardware realization of digial filters, IEEE Transaction on Acoustics, Speech and Signal Processing, vol., pp. 456 46, ecember 974. [] K. Kammeyer, Quantization Error on the istributed Arithmetic, IEEE Transactions on Circuits and Systems, vol. 4, pp. 68 689, 98. [] F. Taylor, An Analysis of the istributed-arithmetic igital Filters, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 35, pp. 65 7, 986. [3] K. Kammeyer, igital Filter Realization in istributed Arithmetic, in Proc. European Conf. on Circuit Theory and esign, Genoa, Italy, 976. [4] C. F. N. Cowan, S. G. Smith, and J. H. Elliott, A digital adaptive filter using a memory-accumulator architecture: Theory and realization, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 3, pp. 54 549, June 983. [5] W. Huang, and.v. Anderson, Modified Sliding-Block istributed Arithmetic with Offset Binary Coding for Adaptive Filters, Journal of Signal Processing Systems, April 3,.