A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

Similar documents
Parallel FIR Filters. Chapter 5

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

Design Optimization Techniques Evaluation for High Performance Parallel FIR Filters in FPGA

FPGA Polyphase Filter Bank Study & Implementation

Verilog for High Performance

Implementation of a Low Power Decimation Filter Using 1/3-Band IIR Filter

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

Modeling and implementation of dsp fpga solutions

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

FPGA Based FIR Filter using Parallel Pipelined Structure

INTRODUCTION TO CATAPULT C

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter

DESIGN AND IMPLEMENTATION OF DA- BASED RECONFIGURABLE FIR DIGITAL FILTER USING VERILOGHDL

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

VHDL for Synthesis. Course Description. Course Duration. Goals

MCM Based FIR Filter Architecture for High Performance

Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 3rd year engineering. Winter/Summer Training

Power and Area Efficient Implementation for Parallel FIR Filters Using FFAs and DA

Case Study on DiaHDL: A Web-based Electronic Design Automation Tool for Education Purpose

Implementing FIR Filters

Design and Implementation of 3-D DWT for Video Processing Applications

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

FPGA Matrix Multiplier

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

An Efficient Implementation of Floating Point Multiplier

DESIGN AND IMPLEMENTATION BY USING BIT LEVEL TRANSFORMATION OF ADDER TREES FOR MCMS USING VERILOG

A Novel Approach of Area-Efficient FIR Filter Design Using Distributed Arithmetic with Decomposed LUT

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

Implementation of High Speed FIR Filter using Serial and Parallel Distributed Arithmetic Algorithm

Hardware Description of Multi-Directional Fast Sobel Edge Detection Processor by VHDL for Implementing on FPGA

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

Bit-Level Optimization of Adder-Trees for Multiple Constant Multiplications for Efficient FIR Filter Implementation

Agenda. Introduction FPGA DSP platforms Design challenges New programming models for FPGAs

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

A HIGH PERFORMANCE FIR FILTER ARCHITECTURE FOR FIXED AND RECONFIGURABLE APPLICATIONS

Digital Filter Synthesis Considering Multiple Adder Graphs for a Coefficient

Fault Tolerant Parallel Filters Based On Bch Codes

Two High Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

DUE to the high computational complexity and real-time

Controller Synthesis for Hardware Accelerator Design

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

Introduction to DSP/FPGA Programming Using MATLAB Simulink

Design of 2-D DWT VLSI Architecture for Image Processing

Implementation of Floating Point Multiplier Using Dadda Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

LogiCORE IP FIR Compiler v7.0

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

Implementation of Two Level DWT VLSI Architecture

Basic Xilinx Design Capture. Objectives. After completing this module, you will be able to:

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

Automatic VHDL Model Generation of Parameterized FIR Filters

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

PINE TRAINING ACADEMY

AnEfficientImplementationofDigitFIRFiltersusingMemorybasedRealization

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

A Matlab/Simulink Simulation Approach for Early Field-Programmable Gate Array Hardware Evaluation

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

Implementation of digit serial fir filter using wireless priority service(wps)

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

Accelerating FPGA/ASIC Design and Verification

The Efficient Implementation of Numerical Integration for FPGA Platforms

FPGA for Software Engineers

International Journal for Research in Applied Science & Engineering Technology (IJRASET) IIR filter design using CSA for DSP applications

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Digital Design with FPGAs. By Neeraj Kulkarni

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA Implementation of Low-Area Floating Point Multiplier Using Vedic Mathematics

A Dedicated Hardware Solution for the HEVC Interpolation Unit

Introduction to Field Programmable Gate Arrays

Xilinx DSP. High Performance Signal Processing. January 1998

THE DESIGN OF HIGH PERFORMANCE BARREL INTEGER ADDER S.VenuGopal* 1, J. Mahesh 2

FPGAs: THE HIGH-END ALTERNATIVE FOR DSP APPLICATIONS. By Dr. Chris Dick

Design Methodologies and Tools. Full-Custom Design

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

Simulink Design Environment

Tutorial - Using Xilinx System Generator 14.6 for Co-Simulation on Digilent NEXYS3 (Spartan-6) Board

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-2 E-ISSN:

FPGAs: FAST TRACK TO DSP

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

A High Speed Binary Floating Point Multiplier Using Dadda Algorithm

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 10 /Issue 1 / JUN 2018

Lecture 3: Modeling in VHDL. EE 3610 Digital Systems

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression

Transcription:

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China 3 Institute of Software, Chinese Academy of Sciences, Beijing, China ABSTRACT In this paper, a hierarchical pipeline FIR filter structure is proposed and implemented using FPGA hardware. It is a flexible multi-rate structure. By adopting the clock of computation several times faster than the sampling rate, multiplications and additions can be finished using the shared component to reduce the logic area. Only a few more delay units are needed to separate the basic FIR filter structure into two levels: in-group and betweengroup. As the number of taps of filter increases, the structure can be easily extended without increasing the delay of critical path. A Simulink-to-FPGA flow is applied to the multi-rate structure of FIR filter with mixed HDL and Simulink blockset design entry. KEY WORDS FIR filters, pipeline, FPGA, Simulink 1. INTRODUCTION Finite Impulse Response (FIR) filters are one of the primary types of digital filters used in various Digital Signal Processing (DSP) applications such as audio signal processing, video convolution functions and telecommunications by virtue of stability and easy implementation. The standard FIR filters design contains a great number of multiplications which require large silicon area, increase the power consumption, and state the upper limit of the maximum sampling rate. Early works have been done on replacing multiplications by decomposing them into simple operations such as addition, subtraction, shift and sharing common sub-expressions [1], on minimizing the delay and the number of adders [2], and on the tradeoffs between truncated multipliers and the accuracy of computation [3]. Various application specific FIR filters are frequently implemented using FPGA [4]. In this paper, a new structure of FIR filter is proposed and implemented with FPGA hardware. It is a flexible twolevel architecture with two clock rates. By adopting a clock several times faster than the sampling rate, the multiplying and adding component can be highly shared for computation, which can greatly reduce the number of multipliers and realize high throughput while it does not augment the delay in the critical path. According to the relation of N (taps) and M (ratio of two clock rates), N / M additional delay units should be added 1 among the delay line, which separate computations into N / M groups. The remaining content of the paper is organized as follows: a review of general structures of FIR filters is given in Section 2. In Section 3, the new FIR structure and its timing of multi-rate are explained in detail. The FPGA design in the Simulink is described in Section 4 with experimental results. Finally, some discussions are presented in Section 5. 2. OVERVIEW OF FIR STRUCTURES An FIR filter is essentially a discrete convolution of the input signal with a set of coefficients. Mathematically, the input-output relations of an FIR filter with N taps (or of order N-1), in the time-domain can be defined as Eq. 1. N 1 k= 0 y [ n] = h[ k] x[ n k] (1) where x is the input data stream, h k is the k-th tap coefficient, and y is the output data stream. In general, such an FIR filter requires N multipliers and N-1 twoinput adders. The general FIR structures include direct form and transposed form. The direct form realization of an FIR filter can be readily developed from the convolution sum description (Eq.1) as shown in Fig.1(a) (tap=5). In the direct form, there are delay units between multipliers. At a time, the current filter input x(n), and previous N-1 samples of the input data are applied to one input of multiplier. The filter output y(n) is the sum of product of every multiplier accumulated by N-1 adders. In the transposed form shown in Fig. 1(b), however, delay units are placed between adders so that the multipliers can be fed simultaneously. Generally, direct form is potentially better for high-frequency operation, but suffers from high latency compared to the transposed one. In addition, the input of each multiplier changes through the chain of taps with the update of new data sample at every clock cycle. Then it will cause a relatively high switching activity within multipliers as a result of higher overall power consumption. In the transposed form, since the data input remains unchanged for a substantial number of multiplications, corresponding to the order of filter, switching activity is reduced with less power consumption. But the input signal has to be multiplied and added to the accumulated value in a single pipeline stage, which limits the clock frequency. Moreover, the transposed form has a

disadvantage of imposing the additional pressure to the implementation by high fan-out requirement of the input signal. (a) (b) Figure1. Two Basic Forms of FIR Filter (N=5) (a) Direct (b) Transposed The symmetry property of a linear-phase FIR filter can be exploited to reduce the number of multipliers into almost half in the direct form implementations. Both odd and even order symmetric FIR structures are illustrated in Fig.2. Other forms such as cascade, lattice and poly-phase structures can also be used as complex FIR filter structures. additional delay units separate the delay line of filtering into two levels: in-group and between-group. We adopt two different clock rates: one sampling rate for delay unit and another faster clock for multiplying and adding computations. Suppose the frequency of computation clock is set to M-1 times faster than the sampling rate. Then an N-tap FIR filter in the direct from can be separated into N / M groups by adding N / M 1 delay units every M taps. Therefore, in each group, the multiplication and additions can be controlled by the faster clock. In other words, M MAC (Multiply- Accumulate) operations in one group can be finished in M cycles of the faster clock, or one cycle of the sampling clock. For one group, only one multiplier and accumulator are needed and shared. Totally, the number of groups, N / M, determines the number of MAC components. Comparatively, N multipliers and N-1 adders are needed for standard direct and transposed forms of an N-tap FIR filter. By inserting the additional delay units, the structure is changed into a hierarchical pipeline. Each group is a stage of pipeline. The accumulation result of current group in the last small cycle of one sampling period is fed into next group in the first small cycle of next sampling period for further accumulation. After N / M stages, the final output y(n) can be calculated corresponding to its input x(n). As shown in Fig.3, the ratio of two clock rates M is set to eight. So a 32-tap FIR filter is separated into four groups by inserting another three taps. There are two levels of pipelines. The first level is in-group with eight stages, and the second is between-group with four stages. Each group will share only one component for MAC computations. Figure3. Two-level Pipeline FIR Filter Structure (tap=32) Figure2. Symmetric Coefficients FIR Filter Structure 3. HIERARCHICAL FIR FILTER DESIGN Based on but different from direct form, a two-level pipeline FIR filter structure is designed by inserting extra intermediate delay units among N-1 delay units. Those The timing diagram of two-level pipeline FIR structure is illustrated in Fig.4. In this example, N=8 and M=4. One additional delay unit is inserted to separate the FIR filter into two groups. Control signal En can be generated by a counter using one-hot coding. By the signal En, multiplexers in each group are required to select the input sample from each delay unit in fast clock cycles (Ingroup1 and In-group2) as the input of multiplication. In Fig. 4, X i refers to the i-th input sample. i in the In-group1

and In-group2 refers to multiplexing of the i-th sample by fast clock. Only one shared MAC component is used to do M times computations in each group. Due to the inserted delay unit, the delay relation of two groups can be seen from Delay Group1 and Delay Group2 in Fig.4. Therefore, each group can be organized as a stage of pipeline. After two stages, 8-tap MAC from In-group1 and In-group2 can be summed together as the output. Since a large number of multiplications in FIR filters are excessively area and power consuming, previous works concentrate on how to simplify them. If the coefficients of FIR filters are constant, decomposition is a more efficient way than employing multipliers. To minimize the number of addition/subtractions required in each coefficient multiplication, the coefficients can be restricted to powerof-two, expressed in CSD (Canonical Signed-Digit) or graph representation [5]. In our method, one contribution is the reduction of the number of MAC components. To further improve the performance, methods of MAC design should be considered. For high performance ASIC structure, optimized multiplying and adding component can be explored using partial product reduction by Booth algorithm. For flexible FPGA design, the coefficients can be preset into embedded RAMs and multipliers and accumulators can be directly exploited for simplicity. Figure 4. Timing Diagram (N=8) 4. MULTI-RATE DESIGN IN SIMULINK With the continued growth in complexity of FPGA-based designs, more flexible, efficient and higher-level design methodology comes up to change the traditional HDLcentric flows. Matlab&Simulink is a well-known tool that allows designers to model a system at a high-level and is ideal for diverse applications, such as digital signal processing, automotive control, image processing, communication, etc. To incorporate the good modeling and simulation functionality of Simulink, major FPGA manufacturers have promoted new products, which are integrated into Simulink as specified blocksets. Xilinx System Generator for DSP [6] and Altera DSP Builder [7] are the popular ones. AccelChip [8] also provides a DSP synthesis tool for FPGA. Those blocksets and tools can implement a full FPGA design flow from Simulink modeling to simulation to hardware [9, 10]. It can transform Simulink model into synthesizable HDL code with test bench. In this paper, we use Xilinx System Generator tool to implement the hierarchical FIR filter on FPGA hardware. For FIR filter design, various filters are already available from Xilinx Reference Blockset in Simulink, which can be easily customized and mapped to FPGA hardware by System Generator. For the new proposed structure, we explore a mixed HDL and Simulink block modeling to this multi-rate design. Fortunately, System Generator provides a means to bring VHDL, Verilog, and EDIF into designs. It also provides HDL co-simulation interfaces to simulate the mixed-module system.

Figure 5. Simulink-to-FPGA FIR Filter Structure The modeling of multi-rate hierarchical FIR filter is shown in Fig. 5 corresponding to the timing diagram in Fig. 4 (tap=8). Two clocks control delay unit and MAC component respectively. The shadowed delay unit is additionally inserted. MAC component is in the Mux- Mul-Acc black box described in HDL. Rate relation of sampling and computing clocks is declared in the configuration M-function of HDL module. The experimental results are shown in Tab.1 and Tab.2. The target FPGA chip is Xilinx Virtex-II xc2v2000. Tab.1 lists the resource and performance of FIR filter blocks provided by System Generator in Simulink, which can be parameterized and exploited directly. With the increasing number of taps, area consumption increases almost linearly while the delay period remains a constant. In Tab. 2, the FPGA logic area and speed of the proposed twolevel FIR filter are illustrated in comparison with Tab. 1 (the Simulink FIR filter blocks). Due to the reduction of MAC components, hardware logic resource has been decreased a lot. Since MAC components are directly described using HDL in our method, the maximum frequency is inferior to the optimized Simulink blockset. Further improvement might be achieved if the MAC components can be optimized. Table 1. Statistics of FPGA Resource Consumption and Speed (Customized FIR Filter Block in Simulink) Resource No. of Taps (#) and Speed 8 16 20 24 32 SLICES 165 326 462 501 656 FLIP FLOPS 297 584 826 906 1192 LUTS 159 352 524 580 793 Delay (ns) 4.2 4.2 4.2 4.2 4.2 Max.Frequency (MHz) 238 238 238 238 238 Other Info. x is 8-bit, h is 10-bit, both are signed. Table 2. Statistics of FPGA Resource Consumption and Speed (Hierarchical FIR Filter) Resource No. of Taps (#) and Speed 8 16 20 24 32 SLICES 72 154 163 203 284 FLIP FLOPS 101 205 213 264 364 LUTS 68 144 162 196 269 Delay (ns) 8.13 8.16 8.2 8.2 8.3 Max.Frequency (MHz) Other Info. 5. DISCUSSION 123 122.6 121.9 121.9 120.5 x is 8-bit, h is 10-bit, both are signed. M=Computing Clk / Sampling Rate=4 From the development of FPGA technology, the methodology challenges the update of various EDA tools. Based on the standard development flow (Fig. 6), initial efforts have been transferred to high-level design and synthesis. There are many conversion tools such as C-to- FPGA, Stateflow diagram to VHDL (SF2VHD), Matlabto-FPGA (MATCH). The features of Simulink-to-FPGA flow can be discussed as follows. Friendly graphics interface. Although the schematic entry is also a GUI interface, the Simulink is easier to organize input data and much convenient to observe output in many ways. Easy to number format conversion. Double to fixed point number conversion is parameterized to functional blocks. But the consistence of data type must be noticed during the data flow.

Flexible modeling and simulation. The design can be well organized into hierarchical modules and easy to be combined with other entry method for design decision and convenient to debug and simulation. Fast time-to-market for DSP development. With the assistance of specified DSP blocks for FPGA, the Simulink-to-FPGA flow can greatly shorten the development cycle from algorithm to hardware. The arithmetic blocksets might be further reinforced. [9] M. A. Shanblatt, B. Foulds, A Simulink-to-FPGA Implementation Tool for Enhanced Design Flow, Proceedings of the 2005 IEEE International Conference on Microelectronic Systems Education (MSE'05), 2005, 89-90. [10] M. Haldar, A. Nayak, A. Choudhary, and P. Banerjee, A System for Synthesizing Optimized FPGA Hardware from MATLAB, Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design, 2001, 314-319. In this paper, a new FIR filter structure is presented and implemented by different methods. The basic direct form of FIR filter is rebuilt as a hierarchical structure by inserting only a few additional delay units. This structure is very flexible to meet different system requirement. Due to the sharing mechanism of MAC components, much area consumption has been reduced. With great concern on the high-level hardware design, the Simulink-to-FPGA modeling and simulation takes the advantage of good graphics interface and flexible design choices. For many DSP applications such as image processing and communication, more functional blocks will be capsulated into FPGA-mapped blocks in the Simulink and the performance will be continuously improved in the future. ACKNOWLEDGMENT The research is supported by the Research Grant of University of Macau. REFERENCES [1] Y. C. Lim, J. B. Evans, and B. Liu, Decomposition of binary integers into signed power-of-two terms, IEEE Trans. Circuits System., vol. 38, 1991, 667-672. [2] Hyeone-Ju Kang and In-Cheol Park, FIR filter synthesis algorithms for minimizing the delay and the number of adders, IEEE Trans. Circuits System, vol.42, 2001, 770-777. [3] E. G. Walters III, Design tradeoffs using truncated multipliers in FIR filter implementations, Master s Thesis, Lehigh University, May 2002 [4] L. Mintzer, FIR Filters with FPGA, Journal of VLSI Signal Processing, 6, 1993, 119-127. [5] Samueli, H., An improved search algorithm for the design of multiplierless FIR filters with powers-of-two coefficients, Circuits and Systems, IEEE Transactions on, Volume: 36 Issue: 7, 1989, 1044-1047. [6] Xilinx, Xilinx System Generator, Version 6.2, Xilinx Inc., USA. [7] Altera,. Altera DSP Builder, Version 5.1, Altera Inc, USA. [8] AccelChip, Integrating MATLAB Algorithms into FPGA Designs, in Xcell Journal, 2005, 73-75.