DESIGN METHODOLOGY. 5.1 General

87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods of converting the time domain data to frequency domain data with less hardware requirement and fast time utilization. 5.2 Fast Fourier Transform The conventional signal and image processing applications requires high computational power based on Fast Fourier Transform (FFT) in addition to the ability to choose the algorithm and architecture. When considering alternate FFT algorithm implementations the criteria to consider are: execution speed, programming effort, hardware design effort, system cost, flexibility and precision. Nevertheless, for real time signal processing the main concern is execution speed. The implementation has been made on a Field Programmable Gate Array (FPGA) as a way of obtaining high performance at economical price and a short time of realization. It can be used with segmented arithmetic of any level of pipeline in order to speed up the operating frequency. 5.3 Fixed-Radix FFT Algorithms In this section we will introduce several fixed-radix FFT algorithms such as radix-2, radix-4, mixed radix-4-2, R2MDC, Proposed Modified R2MDC etc.

88 5.3.1 Radix-2 FFT Algorithm The radix-2 FFT algorithm is obtained by using the divide-and-conquer approach split the output sequence X(k) into two summations[87], one of which involves the sum over the first 2/ N data points and the second sum involves the last 2/ N data points. Thus we obtain, (5.1) Now, let us restrict the discussion to N power of 2 and consider computing separately the even-numbered frequency samples and the odd-numbered frequency samples. Thus we obtain the even-numbered frequency samples as (5.2)

89 Equation (5.2) is the 2 N point DFT of the 2 N point sequence obtained by subtracting the bottom half of the input sequence from the upper half and multiplying the resulting sequence by g (n) and h(n) as n W N. If we define the 2 N point sequences (5.3) Figure 5.1 Signal flow graph of a typical 2-point DFT The computation of the sequence g(n) and h(n) according to Equation (5.3) and the subsequent use of these sequences to compute the N/2 point DFTs are depicted in Figure 5.1. For the 64-point DFT [1], the computation has been reduced to a computation of 2-point DFTs. With the computation of Figure inserted in the signal flow graph of Figure 5.1, we obtain the complete signal flow graph for computation of the 64-point DFT, as shown in Figure 5.1. From Figure 5.2 the proceeding from one stage to the next, the basic computation in the form of Figure 5.1 i.e., it involves obtaining a pair of values in one stage from a pair of values in the preceding stage, where the coefficients are always power of W N and the exponents are separated by N/2. Because of the shape of the signal flow graph, this elementary computation is called a butterfly [84]. It is also noted that the butterfly number of N/2 is regular in each stage. The basic butterfly of Figure 5.1

90 can be redrawn in Figure 5.2, which requires only one complex multiplication and two complex additions. Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Figure 5.2 Radix-2 DIF FFT signal flow graph of 64-point From Figure 5.2 the time domain input data x (n) occurs in natural order, but the frequency domain output DFT X (k) occurs in bit-reversed order. It is also noted that the computations are performed in-place. In-place represents memory read and memory write in each butterfly processing use the same memory location. By this method, the required memory space can be minimized. It is also observed from Figure 5.2, the relationship between the input data and the output data that the output data index is [kok1 klog 2 N-2klog 2 N-1]2 mapped to index [klog 2 N- 1klog 2 N-2... kok1]2 in a one dimension memory array. For example in the 64-point radix-2 DIF FFT signal flow graph, the output index 1011 is mapped to index 1101 of the memory array. For the radix-2 16-point DIF

91 FFT signal flow graph, the relationship between normal order and bit-reversed order can be explained clearly in Figure 5.3. Figure 5.3 Bit-reversed order However, it is possible to reconfigure the decimation-in-frequency algorithm so that the input sequence occurs in bit-reversed order while the output DFT occurs in normal order. Furthermore, if we abandon the requirement that the computations be done in place, it is also possible to have both the input data and the output DFT in normal order. This case is called out-of-place mode. 5.3.2 Radix 4 FFT A radix-4 common-factor FFT algorithm can be employed when N = 4k by recursively reorganizing sequences into N N/4 arrays. The development of a radix-4 algorithm is similar to the development of a radix-2 FFT. Here, both DIT and DIF versions are possible. Rabiner and Gold (1975) provide more details on radix-4 algorithms [89]. The Radix-4 decimation in time butterfly is represented in Figure 5.4. As with the development of the radix-2 butterfly, the radix-4 butterfly is formed by merging a 4- point DFT with the associated twiddle factors that are normally between DFT stages. The four inputs A, B, C, and D are on the left side of the butterfly diagram and the latter three are multiplied by the complex coefficients W b, W c, and W d respectively. These coefficients are all of the same form but are shown with

92 different subscripts here to differentiate the three since there is more than one in a single butterfly. Figure 5.4 Radix-4 DIT butterfly When the number of data points N in the DFT is a power of 4 (i.e., N =4 ), one can always use a radix-2 algorithm for the computation. However, it is computationally more efficient to employ a radix-4 FFT algorithm. Similarly to the radix-2 FFT algorithm we use divide-and-conquer approach decimate the N-point DFT into four point N/4 DFTs. We have From the definition of the twiddle factors, we have (5.4)

93 (5.5) The relation is not an N/4-point DFT because the twiddle factor [90] depends on N and not on N/4. To convert it into an N/4-point DFT we subdivide the DFT sequence into four N/4-point subsequences, X(4k), X(4k+1), X(4k+2), and X(4k+3), k = 0, 1,..., N/4. Thus we obtain the radix-4 decimation-in frequency DFT as where, the property 4kn kn WN WN / 4 (5.6). Note that the input to each N/4-point DFT is a linear combination of four signal samples scaled by a twiddle factor. This procedure is repeated v times, where 4 v log N. 5.3.3 Radix 8 FFT A radix-8 common-factor FFT algorithm can be employed similar to radix 4 when N = 8k by recursively reorganizing sequences into N N/8 arrays. The development of a radix-8 algorithm is also similar to the development of a radix-4 FFT. Since the Radix 8 FFT is beyond the scope of this thesis more descriptions are not included.

94 5.3.4 Split Radix FFT After one has studied the fixed radix (radix-2 and radix-4) algorithms, it is interesting to see that for radix-2 the even-numbered points of the DFT [91, 92] and 20] can be computed independently of the odd-numbered points. This suggests the possibility of using different computational methods for independent parts of the algorithm with the objective of reducing the number of computations. The split-radix FFT (SRFFT) algorithms exploit this idea by using different fixedradix decomposition in the same FFT algorithm. The split-radix approach was first proposed by Duhanmel and Hollmann in 1984 [52]. This FFT algorithm can be developed by mixing various two or more fixed-radix decomposition methods, such as split-radix 2/4, split-radix 2/8, split-radix 2/4/8 etc. Split-radix 2/4 alone is will be considered here. In the mixing fixed-radix, the radix-2 is the basic component because the radix-2 can compute all of power of 2-point DFTs. We illustrate this approach with a DIF SRFFT algorithm. First, we recall that in the radix-2 DIF FFT algorithm, the even-numbered samples of N-point DFT are given as (5.7) A radix-2 suffices for this computation. The odd-numbered samples {X(2k+1)} of the DFT require the pre-multiplication of the input sequence with the twiddle factors n W N. For these samples radix-4 decomposition produces some computational efficiency because the four-point DFT has the largest multiplicationfree butterfly. Indeed, it can be shown that using a radix greater than 4 does not result in a significant reduction in computational complexity.

95 If we use a radix-4 decimation-in-frequency FFT algorithm for the odd-numbered samples of the N-point DFT, we obtain the following N/4-point DFTs: (5.8) Thus the N-point DFT is decomposed into one 2 N -point DFT [93] without additional twiddle factors and two 4 N -point DFTs with twiddle factors. The N- point DFT is obtained by successive use of these decompositions up to the last stage. Thus we obtain a DIF split-radix 2/4 algorithm [6]. The signal flow graph of basic butterfly cell of split-radix 2/4 DIF FFT algorithm is shown in Figure 5.5. Figure 5.5 Signal flow graph of basic butterfly cell of split-radix 2/4 DIF FFT We have, (5.9)

96 As a result, even and odd frequency samples of each basic processing block are not produced in the same stage of the complete signal flow graph. This property causes irregularity of signal flow graph, because the signal flow graph is an L -shape topology. It is noted that the butterfly counting can not have regularity with each stage as the radix-2 or radix-4 FFT algorithm, and its coefficients arrangement is very irregular too, that it requires more effort in implementation than the other FFT algorithms. 5.3.5 Mixed Radix 4-2 FFT The mixed-radix 4/2 butterfly unit is shown in Figure 5.6. It uses both the radix-2^2 and the radix-2 algorithms and can process FFTs that are not power of four. The mixed-radix 4/2 [2], [3], [4], which calculates four butterfly outputs based on X(0)~X(3). The proposed butterfly unit has three complex multipliers and eight complex adders. Four multiplexers represented by the solid box are used to select either the radix-4 calculation or the radix-2 calculation. Figure 5.6 The basic butterfly for mixed-radix 4/2 DIF FFT algorithm In order to verify the proposed scheme, 64-points FFT based on the proposed Mixed-Radix 4-2 butterfly with simple bit reversing for ordering the output

97 sequences is exampled. As shown in the Figure 5.7, the block diagram for 64-points FFT is composed of total six-teen Mixed-Radix 4-2 Butterflies. In the first stage, the 64 point input sequences are divided by the 8 groups which correspond to n3=0, n3=1, n3=2, n3=3, n3=4, n3=5, n3=6, n3=7 respectively. Each group is input sequence for each Mixed-Radix 4-2 Butterfly. After the input sequences pass the first Mixed-Radix 4-2 Butterfly stage, the order of output value is expressed with small number below each butterfly output line in the figure 5.7. The proposed Mixed-Radix 4-2 is composed of two radix-4 butterflies and four radix-2 butterflies [98], [99]. In the first stage, the input data of two radix-4 butterflies which are expressed in equation (5.9), are grouped with the x(n2), x(n/2±n2), x(n/4±n2), x(3n/4±n2) and x(n/8±n2), x(5n/8±n2), x(3n/8±n2), x(7n/8±n2) respectively. After the each input group data passes the first radix-4 butterflies, the outputted data is multiplied by the special twiddle factors. Then, these outputted sequences are inputted into the second stage which is composed of the radix-2 butterflies. After passing the second radix-2 butterflies, the outputted data are multiplied by the twiddle factors. These twiddle factors WQ (1+k) is the unique multiplier unit in the proposed Mixed-Radix 4-2 Butterfly [99] with simple bit reversing the output sequences. Finally, we can also show order of the output sequences shown in above Figure 5.6. The order of the output sequence is 0,4,2,6,1,5,3 and 7 which are exactly same at the simple binary bit reversing of the pure radix butterfly structure. Consequently, proposed mixed radix 4-2 butterfly with simple bit reversing output sequence include two radix 4 butterflies, four radix 2 butterflies, one multiplier unit and additional shift unit for special twiddle factors [98], [99], [100]. The Mixed-Radix 4-2 butterfly structure with simple bit reversing for ordering the output sequences derived by index decomposition techniques is given. The Mixed- Radix 4-2 butterfly structure is using the same number of multiplier as the Radix-

98 2^3 and the Split-Radix 2/4/8 algorithm. However, the Split-Radix 2/4/8 butterfly [88] has not a regular shape. Therefore the realization is very complicated. Figure 5.7 The Mixed-Radix 4-2 butterfly structure 5.3.6 R2MDC FFT Algorithm This section investigates a new architecture for pipelined Radix-2 FFT used in MIMO-OFDM. The radix-2 multipath delay commutation (R2MDC) is one of the commutated architectures of radix-2 FFT algorithm which is used to commutate the values as fast as possible in order to process the values and to commutate the FFT inputs. One of the most straightforward approaches for pipeline implementation of

99 radix-2 FFT algorithm is Radix-2 Multi-path Delay Commutator (R2MDC) architecture [94]. It s the simplest way to rearrange data for the FFT/IFFT algorithm. The input data sequence are broken into two parallel data stream flowing forward, with correct distance between data elements entering the butterfly scheduled by proper delays. At each stage of this architecture half of the data flow is delayed via the memory (Reg) and processed with the second half data stream. The delay for each stage is 4, 2, and 1 respectively. In this R2MDC architecture, both Butterflies (BF) and multipliers are idle half the time waiting for the new inputs. The 8-point FFT/IFFT processor has one multiplier, 3 of radix-2 butterflies, 10 registers (R) (delay elements) and 2 switches (S). Figure 5.8 R2MDC architecture The A input comes from the previous component twiddle factor multipliers (TFM). The B output is fed to the next component, normally BFII. In first cycles, multiplexors direct the input data to the feedback registers until they are filled (position 0 ). On next cycles, the multiplexors select the output of the adders/sub tractors (position 1 ), the butterfly computes a 2-point DFT with incoming data and the data stored in the feedback registers [94]. The detailed structure of BFI is shown in Fig.5.9 (a).

100 Figure 5.9 (a) BF I Structure and Figure 5. (b) BF II Structure The B input comes from the previous component, BFI. The Z output fed to the next component, normally TFM. In first cycles, multiplexors direct the input data to the feedback registers until they are filled (position 0 ). On next cycles, the multiplexors select the output of the adders/sub tractors (position 1 ), the butterfly computes a 2-point DFT with incoming data and the data stored in the feedback registers. The multiplication by j involves real-imaginary swapping and sign inversion. The real-imaginary swapping is handled by the multiplexors MUX in efficiently and the sign inversion is handled by switching the adding-subtracting operations by mean of MUX. When there is a need for multiplication by j, all multiplexors switches to position 1, the real-imaginary data are swapped and the adding-subtracting operations are switched. The detailed structure of BF I and BFII are shown in Figure 5.9 (a) & (b). The adders and sub tractors in BFI and BFII are fully-pipelined and followed by divide-by-2 and rounding [94]. The algorithm used here is to commutate the radix-2 algorithm in the IFFT architecture and to replace by R2MDC architecture in order to get a low area than the existing system. 5.3.7 Proposed Modified R2MDC FFT The Radix-2 butterfly processor is consists of a complex adder and complex subtraction. Besides that, an additional complex multiplier for the twiddle factors W N is implemented. The complex multiplication with the twiddle factor requires four real multiplications and two add/subtract operations.

101 The A input comes from the previous component twiddle factor multipliers (TFM). The B output is fed to the next component, normally BFII. In first cycles, multiplexors direct the input data to the feedback registers until they are filled (position 0 ). On next cycles, the multiplexors select the output of the adders/sub tractors (position 1 ), the butterfly computes a 2-point DFT with incoming data and the data stored in the feedback registers. The detailed structure of BFI is shown in Fig. 5.10 (a). The architecture of BFI and BFII supporting two receive chains is shown in Fig. 5.10 (a) and Fig.5.10 (b). In BFI structure the sample routing MUXs and DEMUXs at the input and output of the BF_RAMs are controlled based on c2 and c3 control signals while the computation unit is controlled by c1 control signal. The control signals are issued by the BFI controller. Depending on the programming of number of receive chains the extra BF_RAMs are enabled. WiMAX supports 1Rx and 2Rx, LTE supports 1Rx, 2Rx and 4Rx. Based on the requirement extra buffers can be extended to the existing BF structure. (a) (b) Figure.5.10 (a) BF I Structure (b) BF II Structure Since the handling -1, +j and -j multiplication is handled inside the BFII structure, two control signals c1 and c2 are used in the basic computation unit. The muxes and

102 the demuxes are controlled by c3 and c4 control signals. The product with -j term is implemented by swapping the real and imaginary part considering the sign of the sample. The algorithm used here is to commutate the radix-2 algorithm in the IFFT architecture [94]. In order to optimize the processor, the proposed shift and add method that eliminates the non-trivial complex multiplication with the twiddle factors (W 1 8, W 3 8 ) and implements the processor without complex multiplication. The proposed butterfly processor performs the multiplication with the trivial factor W 2 8 =-j by 0 switching from real to imaginary part and imaginary to real part, with the factor W 8 by a simple cable. With the non-trivial factors W 1 8 = e -jπ/4, W 3 8 = e -j3π/4, the processor realize the multiplication by the factor 1/ 2 using hardwired shift-and-add operation as shown in Figure.5.11. Figure 5.11 MOD-R2MDC Butterfly FFT with no complex multiplication. 5.4 Summary This chapter includes the detailed description about different FFT design methodology for Radix-2, Radix-4, Radix-8, Mixed Radix 4-2, Split Radix, R2MDC and Modified R2MDC FFT.