Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721, 1344} Fax: +46 13 139282 Email: {weidongl, larsw}@isy.liu.se Abstract: In this paper, we present a class of highradix butterfly elements that utilize (m, n) counters to replace the adders in conventional realization of butterfly elements. With these butterfly elements, we reduce the hardware complexity, delay time, and the power consumption. 1. INTRODUCTION FFT/IFFT has been one of the most important algorithms in digital signal processing. In the recent years, the FFT/IFFT has frequently been applied in the modern communication systems due to its efficiency for OFDM (Orthogonal Frequency Division Multiplex) implementation. Many applications, like xdsl modems, HDTV, mobile radio terminals, use FFT/IFFT processor as a key component. The butterfly elements are one of the basic building blocks in an FFT/IFFT processor. Since FFT/IFFT processors using a radix4 architecture has fewer multiplications than the processors using radix2, the radix4 architectures are often used for FFT/IFFT processors. Higher radix are tend to reduce the memory access rate, arithmetic workload, and, hence, the power consumption [2] [3]. Efficient design of highradix butterfly elements is therefore important. In the following section, we give a short review on the conventional implementation of butterfly elements. We introduce the carrysave based butterfly in section 3 and the results are presented in section 4. 2. A 4POINT DFT WITH A CONVENTIONAL BUTTERFLY ELEMENT 2.1 4point DFT The 4point DFT is defined as X( k) = xn ( )e 2πnk 4 n = 0 where k = 0123,,,. Since e 2πi 4 = ± 1or ± and the multiplications with ± 1 or ± are trivial, i.e., they can be simply realized with bypass, inversion, and/or swap for two scomplement numbers. Hence, it does not require any multiplier to construct a butterfly element for a 4point DFT (radix4 butterfly). 2.2 Conventional butterfly elements for 4point DFT We can rewrite equation (1) in matrix form 3 (1) X( 0) X( 1) X( 2) X( 3) = 1 1 1 1 1 1 1 1 1 1 1 1 x( 0) x( 1) x( 2) x( 3) (2)
Using the numerical strength reduction technique at the wordlevel [4], we obtain the signalflow graph shown below. X(0) X(2) x(1) x(3) X(1) X(3) Complex Multiplication Figure 1 Signalflow graph for 4point DFT. A conventional butterfly element based on an isomorphic mapping of the signalflow graph above, requires therefore 8 adders/subtractors and a delay of 2 additions/subtractions. 3. CARRYSAVE BASED BUTTERFLY ELEMENTS 3.1 Principle of carrysave based butterfly elements for a 4point DFT For the sake of simplicity, we consider only the real part of one output, i.e., X re ( 0) for the 4 point butterfly operation. From eq. (2), X re ( 0) is X re ( 0) = 1 x re ( 0) + 1 x re ( 1) + 1 x re ( 2) + 1 x re ( 3) (3) Consider also a real multiplication Y = B 15. The multiplication can be expressed as Y = 1 ( 2 0 B) + 1 ( 2 1 B) + 1 ( 2 2 B) + 1 ( 2 3 B) (4) Comparing eq. (3) with eq. (4), we find that these two equations are of the same nature, i.e., both the butterfly operation and the multiplication are addition of multiple addends. To speed up the butterfly operation, we can apply the same technique for carrysave multipliers [1]. That is to use an adder tree for the summation of partial products and a fast adder for the vectormerging addition. In a more general scheme the inner adders/subtractors are replaced by (m, n)counters.this reduces the hardware complexity. Since it does not require the sequential operation of addition/subtraction of inputs, the execution time is reduced as well. A simplified notation for the conventional and the carrysave based butterfly element is shown in Fig. 2. Figure 2 } (i) Conventional } } } (ii) Carrysave based (4,2) counter outputs } Fast Adder input output intermediate results Simplified notation for the conventional and the carrysave butterfly element.
3.2 Implementation of carrysave based butterfly element for 4point DFT The carrysave based butterfly element can be realized with (m, n)counters and a fast adder. From the eq. (2), we find that some outputs require both additions and subtractions. Since we use two scomplement number representation, the subtraction can be realized by addition of the negative number. The negative value can be obtained by adding 1 at the LSB to the bitcomplement. Hence, there are more than 4inputs at the LSB which requires (m, 2)counters (m > 4). Since the other bits except the LSB has only 4inputs, the use of (m, 2)counters (m >4)is not efficient. Due to the fact that there are either zero or two inputs that are needed to be changed to their negative values simultaneously, we retain (4,2)counters by adding the carry inputs to the final merging adder instead of adding two correction terms at the carrysave tree (See Fig. 3). (6,2) counter 1 or 0 (4,2) counter 1 or 0 input output 1 or 0 intermediate result (i) Straightforward realization. (i) New realization. Figure 3 Solution for subtractions. For the parallel implementation of radix4 butterfly element, the subtractions are known in advance and this can be used to simplify the implementation further. The XORgates to obtain negative values in the general adder/subtractor can be replaced with inverters. The resulting butterfly element is shown in Fig. 4. The critical path is reduced from a path consisting of two fast adders to a path consisting of an inverter, a (4,2)counter, and a fast adder. xre(0) xim(0) xre(1) xim(1) xre(2) xim(2) xre(3) xim(3) Xre(0) Xim(0) Xre(1) Xim(1) Xre(2) Xim(2) Xre(3) Xim(3) Inverter Fast Adder (4,2)counter Figure 4 Parallel radix4 butterfly element. This technique can be applied for other butterfly architectures as well. For a splitradix (SR) butterfly element [5], the oddindexed terms can use this technique and result in a simpler struc
ture (See Fig. 5). In a simplified butterfly element [6], it can also give an efficient implementation [7] (See Fig. 5). x(1) x(3) (i) Splitradix BE X(0) X(1) Multiplication x(1) Addition X(2) x(3) X(3) Subtraction Control signals (ii) Simplified BE XOR i=0,1,2,3 Fast Adder X(i) Figure 5 Splitradix and simplified butterfly element. (4,2)counter 3.3 Efficient implementation for radix8 butterfly elements Highradix butterfly elements can be realized according to the signalflow graph with the same technique as described above. However, the routing cost for highradix butterfly element becomes excessive. It is therefore often implemented by cascading lower radix butterfly elements (with twiddle factor multiplier in between). Hence we choose to use the radix8 butterfly element as the largest building block. The signalflow graph for a radix8 DIF (Decimation In Frequency) butterfly is shown in Fig. 6 [8]. By moving the two complex multipliers forward according the dashed arrows, the radix 8 butterfly element can be regarded as two radix4 butterfly elements and four radix2 butterfly elements with twiddle factor multiplication between. Using carrysave radix4 butterfly element combining with constant multiplication with [ 2 ( 1 i) ] 2, we can implement the radix8 butterfly element more efficiently. X(0) x(4) X(4) X(5) x(6) x(3) X(1) X(2) c Complex Multiplication x(7) e π/4 X(6) x(1) X(3) x(5) e π/4 X(7) Figure 6 Signalflow graph for radix8 DIF butterfly element. The radix8 butterfly element can also implemented with a splitradix structure. The signalflow graph for splitradix butterfly element [9] differs slightly from that of Fig. 6. It requires in all three SR radix4 SR butterfly elements, and three radix2 butterfly elements.
4. RESULTS In this section, we present the synthesis results for butterfly elements in AMS 0.8 µm technology to demonstrate the efficiency. Table 1 shows the area cost and delay for the main components of the butterfly element with the synthesis tool AutoLogic II from Mentor Graphics. Component Area cost Delay@3.3 V, 25 C VMA(15 bits) 584.24 5.708 ns VMA(16 bits) 643.80 5.747 ns VMA(17 bits) 689.50 6.015 ns 42 counter 20.32 3.441 ns inverter 1.64 0.405 ns Table 1: Performance of key components in butterfly elements. With the result from Table 1, we can calculate the area and delay for different radix4 butterfly elements. The result shows that the area saving can be up to 21% and 38% for carrysave radix 4 butterfly element and SR radix4 butterfly element with carrysave. The delay can be reduced with 22%. Architecture Area cost Delay@3.3 V, 25 C Conventional 10504.16 12.32 ns Carrysave 8266.48 9.59 ns SR with carrysave 6494.80 9.59 ns Table 2: Comparison of different 15bit radix4 butterfly elements. For the radix8 butterfly elements, if we exclude the cost of multipliers, we can summarize the result in the following table. This result can be improved if we apply carrysave technique when possible. Architecture Area cost a Delay a @3.3 V, 25 C a. Excluding the multipliers. 5. CONCLUSION Conventional 32250.40 18.74 ns Carrysave 27774.88 16.01 ns SR with carrysave 27915.84 16.01 ns Table 3: Comparison of different 15bit radix8 butterfly elements (excluding the multipliers). In this paper, we have presented an efficient method to realize the higher radix butterfly elements with carrysave technique. It shows that this method has advantages in term of chip area, short execution time, and power consumption as well.
ACKNOWLEDGEMENT The authors would like to thank Thomas Johansson and Dr. Kent Palmkvist for fruitful discussions. This proect is financed by SSF, the Foundation for Strategic Research in Sweden, under the program of INTELECT. REFERENCES [1] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [2] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Science and Technology, Thesis No. 618, Linköping University, Sweden, 1997. [3] T. Widhe, Efficient Implementation of FFT Processing Elements, Linköping Studies in Science and Technology Thesis No. 619, Linköping University, Sweden, 1997. [4] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley & Sons, 1999. [5] P. Duhamel and H. Hollmann, SplitRadix FFT Algorithm, Electronics Letters, Vol. 20, No. 1, pp. 1416, 1984. [6] G. Bi and E. V. Jones, A Pipelined FFT Processor for WordSequential Data, IEEE Trans. on Acoust., Speech, and Signal Process., Vol. ASSP37, No. 12, pp. 19821985, 1989. [7] W. Li and L. Wanhammar, A Pipeline FFT Processor, accepted for publication at IEEE Workshop on Signal Processing Systems (SiPS), 1999. [8] T. Widhe, J. Melander, and L. Wanhammar, Design of efficient radix8 butterfly PEs for VLSI, IEEE Intern. Symp. on Circuits and Systems (ISCAS), Vol. 3, pp. 2084 2087, 1997. [9] H. V. Sorensen, M. T. Heideman, and C. S. Burrus, On Computing the SplitRadix FFT, IEEE Trans. on Acoust., Speech, and Signal Process., Vol. ASSP34, No. 1, pp. 152156, 1986.