A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China 3 Institute of Software, Chinese Academy of Sciences, Beijing, China ABSTRACT In this paper, a hierarchical pipeline FIR filter structure is proposed and implemented using FPGA hardware. It is a flexible multi-rate structure. By adopting the clock of computation several times faster than the sampling rate, multiplications and additions can be finished using the shared component to reduce the logic area. Only a few more delay units are needed to separate the basic FIR filter structure into two levels: in-group and betweengroup. As the number of taps of filter increases, the structure can be easily extended without increasing the delay of critical path. A Simulink-to-FPGA flow is applied to the multi-rate structure of FIR filter with mixed HDL and Simulink blockset design entry. KEY WORDS FIR filters, pipeline, FPGA, Simulink 1. INTRODUCTION Finite Impulse Response (FIR) filters are one of the primary types of digital filters used in various Digital Signal Processing (DSP) applications such as audio signal processing, video convolution functions and telecommunications by virtue of stability and easy implementation. The standard FIR filters design contains a great number of multiplications which require large silicon area, increase the power consumption, and state the upper limit of the maximum sampling rate. Early works have been done on replacing multiplications by decomposing them into simple operations such as addition, subtraction, shift and sharing common sub-expressions [1], on minimizing the delay and the number of adders [2], and on the tradeoffs between truncated multipliers and the accuracy of computation [3]. Various application specific FIR filters are frequently implemented using FPGA [4]. In this paper, a new structure of FIR filter is proposed and implemented with FPGA hardware. It is a flexible twolevel architecture with two clock rates. By adopting a clock several times faster than the sampling rate, the multiplying and adding component can be highly shared for computation, which can greatly reduce the number of multipliers and realize high throughput while it does not augment the delay in the critical path. According to the relation of N (taps) and M (ratio of two clock rates), N / M additional delay units should be added 1 among the delay line, which separate computations into N / M groups. The remaining content of the paper is organized as follows: a review of general structures of FIR filters is given in Section 2. In Section 3, the new FIR structure and its timing of multi-rate are explained in detail. The FPGA design in the Simulink is described in Section 4 with experimental results. Finally, some discussions are presented in Section 5. 2. OVERVIEW OF FIR STRUCTURES An FIR filter is essentially a discrete convolution of the input signal with a set of coefficients. Mathematically, the input-output relations of an FIR filter with N taps (or of order N-1), in the time-domain can be defined as Eq. 1. N 1 k= 0 y [ n] = h[ k] x[ n k] (1) where x is the input data stream, h k is the k-th tap coefficient, and y is the output data stream. In general, such an FIR filter requires N multipliers and N-1 twoinput adders. The general FIR structures include direct form and transposed form. The direct form realization of an FIR filter can be readily developed from the convolution sum description (Eq.1) as shown in Fig.1(a) (tap=5). In the direct form, there are delay units between multipliers. At a time, the current filter input x(n), and previous N-1 samples of the input data are applied to one input of multiplier. The filter output y(n) is the sum of product of every multiplier accumulated by N-1 adders. In the transposed form shown in Fig. 1(b), however, delay units are placed between adders so that the multipliers can be fed simultaneously. Generally, direct form is potentially better for high-frequency operation, but suffers from high latency compared to the transposed one. In addition, the input of each multiplier changes through the chain of taps with the update of new data sample at every clock cycle. Then it will cause a relatively high switching activity within multipliers as a result of higher overall power consumption. In the transposed form, since the data input remains unchanged for a substantial number of multiplications, corresponding to the order of filter, switching activity is reduced with less power consumption. But the input signal has to be multiplied and added to the accumulated value in a single pipeline stage, which limits the clock frequency. Moreover, the transposed form has a

disadvantage of imposing the additional pressure to the implementation by high fan-out requirement of the input signal. (a) (b) Figure1. Two Basic Forms of FIR Filter (N=5) (a) Direct (b) Transposed The symmetry property of a linear-phase FIR filter can be exploited to reduce the number of multipliers into almost half in the direct form implementations. Both odd and even order symmetric FIR structures are illustrated in Fig.2. Other forms such as cascade, lattice and poly-phase structures can also be used as complex FIR filter structures. additional delay units separate the delay line of filtering into two levels: in-group and between-group. We adopt two different clock rates: one sampling rate for delay unit and another faster clock for multiplying and adding computations. Suppose the frequency of computation clock is set to M-1 times faster than the sampling rate. Then an N-tap FIR filter in the direct from can be separated into N / M groups by adding N / M 1 delay units every M taps. Therefore, in each group, the multiplication and additions can be controlled by the faster clock. In other words, M MAC (Multiply- Accumulate) operations in one group can be finished in M cycles of the faster clock, or one cycle of the sampling clock. For one group, only one multiplier and accumulator are needed and shared. Totally, the number of groups, N / M, determines the number of MAC components. Comparatively, N multipliers and N-1 adders are needed for standard direct and transposed forms of an N-tap FIR filter. By inserting the additional delay units, the structure is changed into a hierarchical pipeline. Each group is a stage of pipeline. The accumulation result of current group in the last small cycle of one sampling period is fed into next group in the first small cycle of next sampling period for further accumulation. After N / M stages, the final output y(n) can be calculated corresponding to its input x(n). As shown in Fig.3, the ratio of two clock rates M is set to eight. So a 32-tap FIR filter is separated into four groups by inserting another three taps. There are two levels of pipelines. The first level is in-group with eight stages, and the second is between-group with four stages. Each group will share only one component for MAC computations. Figure3. Two-level Pipeline FIR Filter Structure (tap=32) Figure2. Symmetric Coefficients FIR Filter Structure 3. HIERARCHICAL FIR FILTER DESIGN Based on but different from direct form, a two-level pipeline FIR filter structure is designed by inserting extra intermediate delay units among N-1 delay units. Those The timing diagram of two-level pipeline FIR structure is illustrated in Fig.4. In this example, N=8 and M=4. One additional delay unit is inserted to separate the FIR filter into two groups. Control signal En can be generated by a counter using one-hot coding. By the signal En, multiplexers in each group are required to select the input sample from each delay unit in fast clock cycles (Ingroup1 and In-group2) as the input of multiplication. In Fig. 4, X i refers to the i-th input sample. i in the In-group1

and In-group2 refers to multiplexing of the i-th sample by fast clock. Only one shared MAC component is used to do M times computations in each group. Due to the inserted delay unit, the delay relation of two groups can be seen from Delay Group1 and Delay Group2 in Fig.4. Therefore, each group can be organized as a stage of pipeline. After two stages, 8-tap MAC from In-group1 and In-group2 can be summed together as the output. Since a large number of multiplications in FIR filters are excessively area and power consuming, previous works concentrate on how to simplify them. If the coefficients of FIR filters are constant, decomposition is a more efficient way than employing multipliers. To minimize the number of addition/subtractions required in each coefficient multiplication, the coefficients can be restricted to powerof-two, expressed in CSD (Canonical Signed-Digit) or graph representation [5]. In our method, one contribution is the reduction of the number of MAC components. To further improve the performance, methods of MAC design should be considered. For high performance ASIC structure, optimized multiplying and adding component can be explored using partial product reduction by Booth algorithm. For flexible FPGA design, the coefficients can be preset into embedded RAMs and multipliers and accumulators can be directly exploited for simplicity. Figure 4. Timing Diagram (N=8) 4. MULTI-RATE DESIGN IN SIMULINK With the continued growth in complexity of FPGA-based designs, more flexible, efficient and higher-level design methodology comes up to change the traditional HDLcentric flows. Matlab&Simulink is a well-known tool that allows designers to model a system at a high-level and is ideal for diverse applications, such as digital signal processing, automotive control, image processing, communication, etc. To incorporate the good modeling and simulation functionality of Simulink, major FPGA manufacturers have promoted new products, which are integrated into Simulink as specified blocksets. Xilinx System Generator for DSP [6] and Altera DSP Builder [7] are the popular ones. AccelChip [8] also provides a DSP synthesis tool for FPGA. Those blocksets and tools can implement a full FPGA design flow from Simulink modeling to simulation to hardware [9, 10]. It can transform Simulink model into synthesizable HDL code with test bench. In this paper, we use Xilinx System Generator tool to implement the hierarchical FIR filter on FPGA hardware. For FIR filter design, various filters are already available from Xilinx Reference Blockset in Simulink, which can be easily customized and mapped to FPGA hardware by System Generator. For the new proposed structure, we explore a mixed HDL and Simulink block modeling to this multi-rate design. Fortunately, System Generator provides a means to bring VHDL, Verilog, and EDIF into designs. It also provides HDL co-simulation interfaces to simulate the mixed-module system.

Figure 5. Simulink-to-FPGA FIR Filter Structure The modeling of multi-rate hierarchical FIR filter is shown in Fig. 5 corresponding to the timing diagram in Fig. 4 (tap=8). Two clocks control delay unit and MAC component respectively. The shadowed delay unit is additionally inserted. MAC component is in the Mux- Mul-Acc black box described in HDL. Rate relation of sampling and computing clocks is declared in the configuration M-function of HDL module. The experimental results are shown in Tab.1 and Tab.2. The target FPGA chip is Xilinx Virtex-II xc2v2000. Tab.1 lists the resource and performance of FIR filter blocks provided by System Generator in Simulink, which can be parameterized and exploited directly. With the increasing number of taps, area consumption increases almost linearly while the delay period remains a constant. In Tab. 2, the FPGA logic area and speed of the proposed twolevel FIR filter are illustrated in comparison with Tab. 1 (the Simulink FIR filter blocks). Due to the reduction of MAC components, hardware logic resource has been decreased a lot. Since MAC components are directly described using HDL in our method, the maximum frequency is inferior to the optimized Simulink blockset. Further improvement might be achieved if the MAC components can be optimized. Table 1. Statistics of FPGA Resource Consumption and Speed (Customized FIR Filter Block in Simulink) Resource No. of Taps (#) and Speed 8 16 20 24 32 SLICES 165 326 462 501 656 FLIP FLOPS 297 584 826 906 1192 LUTS 159 352 524 580 793 Delay (ns) 4.2 4.2 4.2 4.2 4.2 Max.Frequency (MHz) 238 238 238 238 238 Other Info. x is 8-bit, h is 10-bit, both are signed. Table 2. Statistics of FPGA Resource Consumption and Speed (Hierarchical FIR Filter) Resource No. of Taps (#) and Speed 8 16 20 24 32 SLICES 72 154 163 203 284 FLIP FLOPS 101 205 213 264 364 LUTS 68 144 162 196 269 Delay (ns) 8.13 8.16 8.2 8.2 8.3 Max.Frequency (MHz) Other Info. 5. DISCUSSION 123 122.6 121.9 121.9 120.5 x is 8-bit, h is 10-bit, both are signed. M=Computing Clk / Sampling Rate=4 From the development of FPGA technology, the methodology challenges the update of various EDA tools. Based on the standard development flow (Fig. 6), initial efforts have been transferred to high-level design and synthesis. There are many conversion tools such as C-to- FPGA, Stateflow diagram to VHDL (SF2VHD), Matlabto-FPGA (MATCH). The features of Simulink-to-FPGA flow can be discussed as follows. Friendly graphics interface. Although the schematic entry is also a GUI interface, the Simulink is easier to organize input data and much convenient to observe output in many ways. Easy to number format conversion. Double to fixed point number conversion is parameterized to functional blocks. But the consistence of data type must be noticed during the data flow.

Flexible modeling and simulation. The design can be well organized into hierarchical modules and easy to be combined with other entry method for design decision and convenient to debug and simulation. Fast time-to-market for DSP development. With the assistance of specified DSP blocks for FPGA, the Simulink-to-FPGA flow can greatly shorten the development cycle from algorithm to hardware. The arithmetic blocksets might be further reinforced. [9] M. A. Shanblatt, B. Foulds, A Simulink-to-FPGA Implementation Tool for Enhanced Design Flow, Proceedings of the 2005 IEEE International Conference on Microelectronic Systems Education (MSE'05), 2005, 89-90. [10] M. Haldar, A. Nayak, A. Choudhary, and P. Banerjee, A System for Synthesizing Optimized FPGA Hardware from MATLAB, Proceedings of the 2001 IEEE/ACM International Conference on Computer-Aided Design, 2001, 314-319. In this paper, a new FIR filter structure is presented and implemented by different methods. The basic direct form of FIR filter is rebuilt as a hierarchical structure by inserting only a few additional delay units. This structure is very flexible to meet different system requirement. Due to the sharing mechanism of MAC components, much area consumption has been reduced. With great concern on the high-level hardware design, the Simulink-to-FPGA modeling and simulation takes the advantage of good graphics interface and flexible design choices. For many DSP applications such as image processing and communication, more functional blocks will be capsulated into FPGA-mapped blocks in the Simulink and the performance will be continuously improved in the future. ACKNOWLEDGMENT The research is supported by the Research Grant of University of Macau. REFERENCES [1] Y. C. Lim, J. B. Evans, and B. Liu, Decomposition of binary integers into signed power-of-two terms, IEEE Trans. Circuits System., vol. 38, 1991, 667-672. [2] Hyeone-Ju Kang and In-Cheol Park, FIR filter synthesis algorithms for minimizing the delay and the number of adders, IEEE Trans. Circuits System, vol.42, 2001, 770-777. [3] E. G. Walters III, Design tradeoffs using truncated multipliers in FIR filter implementations, Master s Thesis, Lehigh University, May 2002 [4] L. Mintzer, FIR Filters with FPGA, Journal of VLSI Signal Processing, 6, 1993, 119-127. [5] Samueli, H., An improved search algorithm for the design of multiplierless FIR filters with powers-of-two coefficients, Circuits and Systems, IEEE Transactions on, Volume: 36 Issue: 7, 1989, 1044-1047. [6] Xilinx, Xilinx System Generator, Version 6.2, Xilinx Inc., USA. [7] Altera,. Altera DSP Builder, Version 5.1, Altera Inc, USA. [8] AccelChip, Integrating MATLAB Algorithms into FPGA Designs, in Xcell Journal, 2005, 73-75.