An FFT/IFFT design versus Altera and Xilinx cores

Size: px

Start display at page:

Download "An FFT/IFFT design versus Altera and Xilinx cores"

Cornelia Fields
6 years ago
Views:

2008 International Conference on Reconfigurable Computing and FPGAs An FFT/IFFT design versus Altera and Xilinx cores C. Gonzalez-Concejero, V. Rodellar, A. Alvarez-Marquina, E.

1 2008 International Conference on Reconfigurable Computing and FPGAs An FFT/IFFT design versus Altera and Xilinx cores C. Gonzalez-Concejero, V. Rodellar, A. Alvarez-Marquina, E. Martinez de Icaya and P.Gomez-Vilda Departamento de Arquitectura y Tecnología de Sistemas Informáticos. Grupo de investigación en informática aplicada al procesamiento de señal e imagen. Facultad de Informática Universidad Politécnica de Madrid. Campus de Montegancedo s/n Boadilla del Monte (Madrid) -SPAI cconcejero@gmail.com; victoria@pino.datsi.fi.upm.es Abstract In this paper, a portable hardware design implementing a Fast Fourier Transform oriented to its reusability as a core is presented. The module has been developed using radix-2 Decimation-In-Time algorithm. Structural modeling is implemented using VHDL to describe, simulate and perform the design. The module is portable among different EDA tools and technology independent. It has been synthesized with Quartus II from Altera and ISE from Xilinx. The detailed performance results are presented, as well as a comparison between these and the results provided by Altera and Xilinx FFT IP cores. These show that the proposed design produces better results in the use of physical resources but worsens throughput when compared against the commercial ones. Besides, the IP core from Xilinx shows better throughput than Alteras s but at a higher implementation cost. 1. Introduction IP cores are part of the growing Electronic Design Automation (EDA) industry trend towards repeated use of previously designed components. IP cores offered by vendors are rigorously tested and optimized for the highest performance and lowest cost in programmable logic devices. These parameterized IP blocs can be implemented easily, reducing design and test time and also time-to-maret because they avoid the process of designing standardized functions from scratch. Ideally these blocs should be entirely portable among different EDA tools and fully parameterizable. But most vendor companies offer only their own nonportable IP cores with many features and functionalities, which sometimes are useless for an specific application. The Fast Fourier Transform (FFT) and its Inverse (IFFT) are fundamental blocs being used in many applications in science and engineering, such as communications, spectrum analysis, and implementations of digital signal processing, etc. The main FPGA companies as Altera and Xilinx offer FFT/IFFT cores that can be easily embedded in more complex designs with their design tools and are supported and optimized for a wide range of their device families. The FFT/IFFT v5.0 from Xilinx allows transform sizes from 8 to samples, data precision from 8 to 24 bits, floating point and unscaled and scaled fixed point arithmetic, four different architectures to choose from, bloc or distributed, run time programmable, etc [2]. The FFT/IFFT v8.0 from Altera allows transformation sizes from 64 to samples depending on the type of architecture chosen, data precision from 8 to 32 bits, floating point and fixed point arithmetic, embedded memory, multiple I/O data flow modes, etc [3]. In this paper, we present a radix-2 FFT/IFFT design that allows any size points to transform, fixed point arithmetic, pipeline structure and parameterized data format. The synthesis performance results of the proposed model will be compared with the FFT/IFFT cores from the vendors mentioned before and the advantages and disadvantages of each realization will be discussed. The next section describes the principles of the FFT structure and the mathematical formulation. The architectural design is presented in section 3. Section 4, shows implementation and design results. Finally, conclusions are exposed in section The FFT algorithm Audio and communications signal processing are well developed lines massively used nowadays in many application lines and products. Since digital communications are quite active fields, the arithmetic /08 $ IEEE DOI /ReConFig

2 complexity of the Discrete Fourier Transform (DFT) algorithm becomes a significant factor with impact in global computational costs. Cooley and Tuey [1] developed the well-nown radix-2 Fast Fourier Transform (FFT) algorithm to reduce the computational load of the DFT. It can lower the arithmetic complexity from O( 2 ) to O( log ) and the regularity of the algorithm is suitable for VLSI implementation. Among different FFT approaches ([4], [5] and [6]), the fixed radix and the split radix methods are two most widely used approaches. A split radix FFT is theoretically more efficient than a fixed radix algorithm [7], since it shows the least computation complexity among traditional FFT algorithms. However the supporting structure would render it less suitable for implementation on digital signal processors. Unlie the irregular butterfly structure of split-radix FFT, fixed-radix FFT is simple to analyze and implement in hardware due to its structural regularity. Therefore, the fixed-radix FFT is by far more widely used although it involves more computations from the algorithmic point of view. The -point DFT of a sequence x() is defined as [8]: 1 n = X ( n) x( ) W n = 0,1,..., 1 = 0 where W = j2π / 2π 2π e = cos jsin (1) (2) is referred as the twiddle factor, is the transform size and j = 1. On its turn depends on the number of stages and the number of samples. Similarly the Inverse Discrete Fourier Transform (IDFT) is expressed as: 1 = 0 1 n x ( ) = X ( n) W (3) The algorithm used in the present processor implementation is the version of the Cooley and Tuey s Decimation-In-Time (DIT) FFT algorithm. The DIT algorithm first rearranges the input elements in bit-reversed order and then builds up the output transform. Figure 1 shows the form of this scrambling for an 8-point FFT; to the left input data samples are arranged in bit-reversed order. As it can be seen, the - point DIT-FFT algorithm consists of log 2 stages, each stage consisting of /2 butterfly operations [9]. The input data are multiplied by the twiddle factor. The solid dots represent addition/subtraction operations. The outputs are arranged in their natural order. Figure 1. Signal flow graph for 8-point DIT-FFT with input scrambling. The DIT-FFT radix-2 butterfly is shown in Figure 2 [9]. It taes a pair of complex input data values A and B and produces a pair of complex outputs A and B : A = x + jx (4) B = y + jy (5) where x, y and X, Y are respectively the real and imaginary parts of the input data and: A ' = A + BW (6) B' = A BW (7) Figure 2. Radix-2 butterfly structure Taing into consideration (2), (4) y (5), the equations (6) and (7) may be written as: [( x + y cos ( 2π / ) + Y sin ( 2π / )) + Y cos( 2π / ) y sin ( 2π / ) A' = + (8) j( X )] (9) B' = [( x y cos( 2π / ) Y sin ( 2π / )) + j( X Y cos( 2π / ) + y sin ( 2π / ))] 3. Architectural design The objective of this paper is to implement expressions (8) and (9) in an efficient way, having in mind the reusability of the resulting design as an embedded core 338

3 in a possible wide range of applications. The design has been modeled in VHDL according to the restrictions and recommendations for high level synthesis [10]. The design will be portable among different EDA tools and technology independent. This module is designed to be integrated in a Speech Recognition System. The FFT architecture consists of a single DIT-FFT radix-2 butterfly, a double-port memory to hold the values of the input samples, intermediate operations and results, a control unit, an address generation unit and two ROM memories to store the twiddle factors. The bloc diagram of the FFT is depicted in Figure 3. The scheduling details are given in Table 1. The architecture of the FFT processor can best be understood by inspecting its operation details. The operation is first partitioned into three main processes. The DATA load, COMPUTE and RESULT unload. The operation cycle starts with the DATA load process. This process consists of ing and loading sample data in the memory. During the COMPUTE process, the ernel butterfly operation is calculated. Finally, in the RESULT unload process the FFT results are made available at the output, y to be used by another application. A brief description of the main blocs will be given next. Figure 3. Bloc diagram of the FFT/IFFT A. ROMs and ROM memories store W coefficients. The sizes of these memories are /4, due to the symmetric properties of the trigonometric functions. Since the amplitude of the sine and cosine are the same in the four quarters, they only differ in the signs. According to the system worflow, two data must be from the with a cycle delay between them (Table 1, cycle 0 and 1) and loaded to the butterfly unit. Meanwhile, the two outputs of the butterfly bloc have to be written in the with a cycle delay between them (Table 1 (cont), cycle 6 and 7). B. Butterfly element The butterfly is the nuclear calculation. The butterfly taes two data from memory and computes two other data from them. Results are written bac to the same memory locations of the inputs since an inplace algorithm is used. This maes efficient use of the available memory as the transformed data overs the input data. The structure of the butterfly employing an straightforward implementation of (8) and (9), requires four multipliers, three adders, three subtractors and two modules to lin the real and imaginary parts of the data (Figure 4). Figure 4. Butterfly processing architecture The arithmetic operations involved in this bloc are performed accordingly with a pipeline data flow structure. The operations to calculate a butterfly demand four time instants (cycle 2 to 5), as it can be seen in the butterfly scheduling shown in Table 1. C. Address generation and control units The purpose of the Address Generation Unit (AGU) is to produce valid addresses for the and the ROM blocs. It also eeps trac of which butterfly is being computed in which stage. The bloc level description of the AGU basically consists of a log 2 - bit up counter, a ram_index generator and rom_index generator. The counter output is used to address the during the DATA load and RESULT unload processes. During the DATA load process data should be bit-reversed while being written, but no extra hardware is required for implementing the bit-reversed, it may simply be carried out by wire reversal. Moreover, the counter eeps trac of the current stage in the FFT computation, and supplies the ram_index generator with the number of the stage that is currently being computed. The ram_index generator is responsible for generating addresses for the during the COMPUTE process. The input of the ram_index is the address provided by the counter. The addresses to and data inputs, A and B, can be calculated as follows: 339

4 The control unit is implemented as a finite state machine with twelve states. The sequence of events is determined by the control unit depending on the signals it receives from the corresponding units and also generates other control signals to tae care of houseeeping duties, i.e, incrementing and clearing counters. cycle A(x,X) B(y,Y) nexta 1 nextb 1 nexta 2 ROM cosφ nextcosφ 1 ROM Sinφ nextsinφ 1 Mult M1=Ycosφ nextm1 1 Mult M2=Ysinφ nextm2 1 Mult M3=ycosφ nextm3 1 Mult M4=ysinφ nextm4 1 +/- S1=M3+M2 +/- S2=M1-M4 +/- S3=x+S1 +/- S4=X-S1 +/- S5=x+S2 +/- S6=X-S2 Lin Lin Table 1. Butterfly scheduling (cycles 0-4) cycle nextb 2 nexta 3 nextb 3 nexta 4 nextb 4 ROM nextcosφ 2 nextcosφ 3 nextcosφ 4 ROM nextsinφ 2 nextsinφ 3 nextsinφ 4 Mult nextm1 2 nextm1 3 Mult nextm2 2 nextm2 3 Mult nextm3 2 nextm3 3 Mult nextm4 2 nextm4 3 +/- nexts1 1 nexts1 2 nexts1 3 +/- nexts2 1 nexts2 2 nexts2 3 +/- nexts3 1 nexts3 2 +/- nexts4 1 nexts4 2 +/- nexts5 1 nexts5 2 +/- nexts6 1 nexts6 2 Lin A = nexta 1 nexta 2 S3+jS5 Lin B = nextb 1 nextb 2 S4+jS6 A nexta 1 B nextb 1 Table 1(cont). Butterfly scheduling (cycles 5-9) The address B is calculated just changing the bit 1 to 0 in the fragment of the algorithm shown before. The rom_index generator is responsible for producing addresses for the ROM during the COMPUTE process. It only requires nowing the current stage to generate de address. 4. Implementation results Generally speaing it is very difficult to mae a fair comparison among design performance because there is not a standard benchmaring methodology for FPGA`s. Current CAD tools provide a settings menu that allow to explore different trade-offs among design performance, logic resources demanded, power consumption, memory usage and compilation time. Additionally, user constraints can be included to guide the CAD tool to improve performance results. But the settings producing the best results for one design may not be appropriated for another. The compilation results that will be presented next were obtained with default settings and no constraints. Our model has been synthesized with Quartus II v8.0 and ISE v10.1, and their results have been compared against the FFT/IFFT cores available in the DSP libraries of these CAD tools, which are v8.0 for Altera and v5.0 for Xilinx. These cores have been included using the MegaWizard Plug-In Manager tool for Altera and CoreGen tool for Xilinx. Their structures and detailed pin count can be found in [2][3]. The device selection is also critical due to the differences in the FPGA inner architectures, some designs being easily implemented in a specific architecture while others are not. To tae this aspect into consideration, we have chosen families of the FPGA vendors that may be considered as technologically comparable. Concerning the election of the specific devices to implement the designs, the criterion has been to choose a device of enough size to support a real time speech recognition system (our goal application), where the FFT IP will be an embedded bloc. The target devices used for performing the designs have been Stratix II EP2S15F484C3 from Altera and Virtex IV xc4vlx15-12sf363 from Xilinx. IP commercial cores offer different possibilities of configuration (arithmetic, radix, architectures, number of butterfly engines, I/O modes, etc.) that must be carefully selected to the closest characteristics of the design in order to render comparable performance results. The summary of the characteristics of our design is: Decimation in Time FFT algorithm (DIT), radix-2, fixed point arithmetic two s complement, single butterfly engine, pipeline structure, number of samples (), data size and twiddle factors parameterized, structure implemented with 4 340

5 multipliers/6 adders. The options chosen for the commercial IP core generation were: Xilinx: Unscaled Arithmetics (full precision) fixedpoint, burst I/0 architecture because it uses the DIT method and radix-2, output in natural order and data and twiddle factors in. Altera: Arithmetic bloc floating-point, burst data flow I/0 due to it s the only option to generate a single output FFT engine, number of parallel engines =1 and 4 multipliers/2 adders implementation. The synthesis results of our design vs. FFT from Altera and Xilinx are shown in Table 2 and Table 3. The results presented in both, are for = 64, 128, 256, 512, 1024, 2048 and 4096 samples. The data and twiddle sizes are 16 bits in all cases. The upper part of each cell (grey shadow) contains the results for the design (). The lower part of the cells contains the results for IP vendors (VC). The comparisons among the results have been carried out in terms of the physical resources, number of pins, memory occupation, DSPs and Fmax used. The number of resources available in the devices and the amount required for each particular implementation are also indicated. It must be noticed that the percentages shown in the tables are provided by the tools. They round up or round down depending on the case. ALUTs (12480) Registers (12480) Pins (343) Mem. bits (419328) DSP (96) Fmax MHz VC (2%) 478(4%) 67(20%) 3190(<1%) 8(8%) (5%) 1365(11%) 85(25%) 2560(<1%) 8(8%) (2%) 483(4%) 67(20%) 6271(1%) 8(8%) (6%) 1486(12%) 85(25%) 4864(2%) 8(8%) (2%) 488(4%) 67(20%) 12424(3%) 8(8%) (6%) 1430(11%) 85(25%) 9472(3%) 8(8%) (2%) 491(4%) 67(20%) 24721(6%) 8(8%) (6%) 1532(12%) 85(25%) 18688(4%) 8(8%) (2%) 494(4%) 67(20%) 49306(12%) 8(8%) (6%) 1477(12%) 85(25%) 37120(9%) 8(8%) (2%) 497(4%) 67(20%) 98467(23%) 8(8%) (6%) 1578(13%) 85(25%) 73184(18%) 8(8%) (2%) 500(4%) 67(20%) (49%) 8(8%) (7%) 1522(12%) 85(25%) (35%) 8(8%) Table 2. Altera results for EP2S15F484C3 By comparing the results obtained from Altera s IP with the results from this design the following similarities and differences may be observed. In both cases, the number of demanded ALUTs remains in a similar percentage for all values of but the implementation of the IP vendor demands around three times more resources than ours. The same behavior is observed for the number of registers needed. The number of pins is constant for all values of but our design needs 18 pins less than the IP vendor. Obviously, the demand of memory is proportionally increasing with the size of. The performance of this parameter is better in Altera s core than in this design and in our case it gets worse as increases. According to the percentages given by the CAD tool, a difference of 1% for =256, 3% for = 1024 and 14% for =4096 can be appreciated. The number of DSP is the same in both cases. As expected in this case, the frequency is decreasing according to the increment of but in the case of Altera the behavior seems to be erratic. It can be noticed from Table 2 that frequency increases and decreases as duplicate. In a first analysis, the results seem to lac consistence, it may be observed that for =128 samples the value of F = MHz, for =256 the value increases to MHz, for =512 it decreases to MHz and for =1024 the value increases to MHz, the same tendency can be observed for the rest of the values shown in Table 2. Analyzing these results grouping the odd and even power of two of, the results show a different interpretation. In both groups the frequency decreases as the increases. For odd powers of (128, 512 and 2048) the frequency decreases as MHz, MHz and 225 MHz. respectively. And for even powers of (256, 1024, and 4096) it decreases as MHz, MHz and respectively. By comparing our results with these groups it may be concluded that our results are better for the odd powers but worse for the even powers of. This same erratic behavior maybe noticed for the number of registers. The reason given by the vendor is that for burst architectures radix-4 decomposition is normally applied unless is an odd power of two then the FFT megacore automatically implements a radix-2 in the last pass to complete the transformation. VC FF (12288) LTUs (12280) Pins (240) SLICEs (6144) 48 DSP (32) Fmax MHz (3%) 324(2%) 67(28%) 283(4%) 3(6%) 4(12%) (8%) 839(6%) 100(41%) 749(12%) 5(10%) 6(18%) (3%) 340(2%) 67(28%) 291(4%) 3(6%) 4(12%) (9%) 888(7%) 104(43%) 779(12%) 5(10%) 6(18%) (3%) 346(2%) 67(28%) 295(4%) 3(6%) 4(12%) (9%) 944(7%) 108(45%) 836(13%) 5(10%) 6(18%) (3%) 353(2%) 67(28%) 301(4%) 3(6%) 4(12%) (10%) 1035(8%) 112(46%) 903(14%) 5(10%) 6(18%) (3%) 361(2%) 67(28%) 305(4%) 4(8%) 4(12%) (10%) 1098(8%) 116(48%) 941(15%) 6(10%) 6(18%) (3%) 368(2%) 67(28%) 310(5%) 6(12%) 4(12%) (11%) 1172(9%) 120(50%) 1003(16%) 9(18%) 6(18%) (4%) 375(3%) 67(28%) 314(5%) 12(25%) 4(12%) (12%) 1229(10%) 124(51%) 1043(16%) 15(31%) 6(18%) Table 3. Xilinx results for xc4vlx15-12sf363 Concerning the results for Xilinx shown in Table 3, in our solution the demand of slice FLIP-FLOPs shows almost a constant percentage for all values of, whereas in the IP commercial core these resources increase according to, and can be noticed that the difference between both designs is larger as is increased. A similar behavior is observed for the number of LUTs and occupied slices. In our case, the number of pins remains the same (67) for all values of and Xilinx IP requires 4 pins more each time duplicates. In both implementations the blocs remain around the same percentage (6% and 10%) up to 512 samples and it increments for the rest of the values in the Table. The number of DSP is the same for all values of but our implementation uses 2 DSP s 341

6 less than the Xilinx s one. The results for all physical resources commented above are better in our implementation than in the implementation from Xilinx IP but this produces much better results for Fmax, over passing our solution in around 150 MHz. Concerning latency, our design presents poor results compared with the commercial ones because it needs 4 cycles to mae the calculations of one butterfly while the others only need one cycle. The total number of cycles estimated and the throughput for calculating a complete FFT for 256 and 1024 samples are given in Table 4. Cycles Throug. ( μs) Altera Xilinx Altera IP Xilinx IP = = Table 4. Throughput for 256 and 1024 samples The throughput of our core is better when implemented in Altera than in Xilinx being around 3 times faster for the first and between 4.5 to 5 times for the second. If we compare Xilinx s and Altera s IPs the latency is similar but Xilinx achieve higher frequencies and better throughput results. 5. Summary and conclusions This paper presents an -point FFT/IFFT architecture which is portable among different EDA tools and technology independent. The design is oriented to its reusability as a core. The performance of the design has been compared with the commercial cores provided by Altera and Xilinx vendors. Those cores were configured with the closet characteristics to our design in order to mae the results comparable. The performance of our design presents better results in terms of physical resources demanded but the throughput is poorer when compared with the IP commercial implementations. Concerning IP commercial cores, Xilinx gives better throughput than Altera. The implementation cost between them is difficult to evaluate in a fair manner because the FPGA s inner structures are different but in a first approach, taing as reference the results of our design, the implementation for Xilinx seems to be more costly. Along with these performance results come other considerations which need to be evaluated to select the best approach depending on system requirements lie easy implementation, costs and performance. The generation of a design from an IP commercial core is as easy as to press a button but you don t have any control over the design because they are provided as a blac box. They offer a variety of features and functionalities to be configured and supposedly their implementations are optimized for a subset of their devices, giving the best performance for them but they lac portability. Besides the economical cost, the system requirements could need less performance than that offered by IP commercial cores and this is the case of the present application. Our FFT design has been integrated as part of a Speech Recognition System for isolated commands and implemented in a FPGA together with the other parts of the system such as end point detection, MFCC feature extraction and HMM modeling. In this case the physical resources performance in order to have full implementation of the system in the same FPGA is more important than other criteria used, as far as real time processing is achieved and this condition is fulfilled with the design described in this paper. 6. Acnowledgments This wor was supported by grants CCG06- UPM/IF28, TEC C02-00 from Plan acional de I+D, Ministry of Education and Science and by Project HESPERIA ( from the Programme CEIT, Ministry of Industry, Spain. 7. References [1] J.W. Cooley, J.W. Tuey, An algorithm for the machine calculation of complex Fourier series, Math of comp, 1965, vol.9, pp [2] ft.pdf [3] [4] J.-Y. Oh, M.-S. Lim, A radix-2 4 SDF pipeline FFT processor for OFDM modulation, in: The First IEEE VTS APWCS (Asia Pacific Wireless Communications Symposium), January [5] Lihong Jia, Yonghong Gao, Jouni Isoaho and Hannu Tenhunen, A new VLSI-oriented FFT algorithm and implementation, IEEE ASIC Conf., 1998, pp [6] Saad Bouguezel, M. Omair Ahmad and M..S. Swamy, An efficient split-radix FFT algorithm, Int. Symp. Circuits Systems, 2003, pp [7] S. G. Johnson and M. Frigo, A modified split-radix FFT with fewer arithmetic operations, IEEE Transactions on Signal Processing, 2007, pp [8] B. J. Proais and D. G. Manolais, Digital Signal Processing: Principles, Algorithms and Applications, 2nd ed. ew Yor: Macmillan, 1992 [9] W. B. Jervis and E. C. Ifeachor, Digital Signal Processing: A Practical Approach. Reading, MA: Addison-Wesley, [10] M. Keating and P. Bricand, Reuse Methodology Manual: For System-on-a-Chip Desings. Third Edition. Kluwer Academic Publishers,

Hardware Reusable Design of Feature Extraction for Distributed Speech Recognition

Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 47 Hardware Reusable Design of Feature Extraction for Distributed Speech