Feedforward FFT Hardware Architectures Based on Rotator Allocation

Size: px
Start display at page:

Download "Feedforward FFT Hardware Architectures Based on Rotator Allocation"

Transcription

1 Feedforward FFT Hardware Architectures Based on Rotator Allocation Mario Garrido Gálvez, Shen-Jui Huang and Sau-Gee Chen The self-archived postprint version of this journal article is available at Linköping University Institutional Repository (DiVA): N.B.: When citing this work, cite the original publication. Garrido Gálvez, M., Huang, S., Chen, S., (2018), Feedforward FFT Hardware Architectures Based on Rotator Allocation, IEEE Transactions on Circuits and Systems Part 1, 65(2), Original publication available at: Copyright: Institute of Electrical and Electronics Engineers (IEEE) IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 1 Feedforward FFT Hardware Architectures based on Rotator Allocation Mario Garrido, Member, IEEE, Shen-Jui Huang and Sau-Gee Chen Abstract In this paper we present new feedforward FFT hardware architectures based on rotator allocation. The rotator allocation approach consists in distributing the rotations of the FFT in such a way that the number of edges in the FFT that need rotators and the complexity of the rotators are reduced. Radix-2 and radix-2 k feedforward architectures based on rotator allocation are presented in this paper. Experimental results show that the proposed architectures reduce the hardware cost significantly with respect to previous FFT architectures. Index Terms Fast Fourier Transform (FFT), Multi-path delay commutator (MDC), Pipelined architecture, Radix-2, Radix-2 k. I. INTRODUCTION THE fast Fourier transform (FFT) is one of the most important algorithms in the field of digital signal processing. It is used to calculate the discrete Fourier transform (DFT) efficiently. In order to meet the high performance and realtime requirements of modern applications, hardware designers have always tried to implement efficient architectures for the computation of the FFT. In this context, pipelined hardware architectures [1] [27] are widely used, because they provide high throughput and low latency suitable for real time, as well as a reasonably low area and power consumption. There are three main types of pipelined FFT architectures: feedback (FB), feedforward (FF) and serial commutator (SC). First, feedback architectures [1] [14] are characterized by their feedback loops, i.e., some outputs of the butterflies are fed back to the memories at the same stage. Feedback architectures are divided into single-path delay feedback (SDF) [1] [5], which process a continuous flow of one sample per clock cycle, and multi-path delay feedback (MDF) or parallel feedback [6] [14], which process several samples in parallel. Second, feedforward architectures [3], [4], [15] [21], which mainly consist of multi-path delay commutator (MDC) FFTs [3], do not have feedback loops and each stage passes the processed data to the next stage. MDC architectures process several samples in parallel. A comparison of parallel FFT architectures is provided in [16]. Finally, SC FFT architectures [22] are characterized by the use of circuits for bitdimension permutation of serial data. M. Garrido is with the Department of Electrical Engineering, Linköping University, SE Linköping, Sweden, mario.garrido.galvez@liu.se S.-J. Huang is with Novatek Corp., Hsinchu 300, Taiwan, shenray.ee95g@g2.nctu.edu.tw S.-G. Chen is with the Department of Electronics Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, sgchen@cc.nctu.edu.tw This work was supported by the Swedish ELLIIT Program. Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an to pubs-permissions@ieee.org If the purpose is to improve parallel FFT architectures, MDC FFTs already achieve the minimum number of butterflies with 100% usage and the minimum amount of total memory [16]. However, there is still room for improvement regarding rotators: Different radices lead to different amount of rotators, as shown in [16]. For instance, radix-2 3 and radix- 2 4 architectures have less general rotators than radix-2 and radix-2 2. However, instead of general rotators, they have a larger number of W 8 and W 16 rotators. Therefore, there is a trade-off among rotators. This paper presents a new idea called rotator allocation. The purpose is to reduce the number and the complexity of the rotators even further. The number of rotators is reduced by only having rotations by 0 in some of the FFT edges. The complexity of the rotators is reduced by distributing the rotations of the FFT so that similar rotations are calculated by the same rotator. For instance, a rotation by 45 and a rotation by 135 can be easily calculated by the same rotator as it only consists of a rotation by 45 plus a trivial rotation by 0 or 90. New feedforward FFT architectures based on rotator allocation are proposed in this paper. These architectures use radices that lead to a small amount of hardware resources. The paper is organized as follows. Section II reviews the FFT algorithm. Section III summarizes the types of rotators used in FFT architectures and assigns a cost to them in terms of equivalent adders. Section IV provides general guidelines to design an FFT hardware architecture. Section V explains the design of FFT hardware architectures using rotator allocation. Section VI presents the proposed architectures both for radix-2 and radix-2 k. Section VII compares the proposed architectures to the state-of-the-art in terms of the number of multipliers and equivalent adders. Section VIII provides experimental results and compares them to to previous works in the literature. Finally, Section IX summarizes the main conclusions of the paper. II. THE FFT ALGORITHM The N-point DFT of an input sequence x[n] is defined as: X[k] = N 1 n=0 x [n] W nk N, k = 0, 1,..., N 1 (1) where W nk N = e j 2π N nk. To calculate the DFT, the FFT based on the Cooley-Tukey algorithmm [28] is mostly used. The Cooley-Tukey algorithm reduces the number of operations from O(N 2 ) for the DFT to O(N log 2 N) for the FFT.

3 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 2 TABLE I SYMMETRIC ANGLE SETS FOR THE 16-POINT FFT. Set 0 Set 1 Set TABLE II SYMMETRIC ANGLE SETS FOR THE 32-POINT FFT. Set 0 Set 1 Set 2 Set 3 Set Fig. 1. Flow graph of a 16-point DIF FFT. in the explanations throughout the paper. Different radices only differ in the rotations at the different FFT stages [30], whereas the butterflies are the same. Thus, different algorithms lead to different distributions of rotations, which may influence the design of hardware architectures, as shown throughout this paper. (a) Fig. 2. Rotation angles used in the FFT for different values of φ. (a) 16-point FFT. (b) 32-point FFT. Figure 1 shows the flow graph of a 16-point radix-2 FFT according to the Cooley-Tukey algorithm, decomposed using decimation in frequency (DIF) [29]. The FFT consists of n = log 2 N stages. At each stage of the graphs, s {1,..., n}, butterflies and rotations are calculated. The lower edges of the butterflies are always multiplied by 1. These 1 are not depicted in order to simplify the graphs. The numbers at the input represent the index of the input sequence, whereas those at the output are the frequencies, k, of the output signal X[k]. Finally, each number, φ, in between the stages indicates a rotation by: (b) W φ N = e j 2π N φ (2) As a consequence, samples for which φ = 0 do not need to be rotated. Likewise, if φ [0, N/4, N/2, 3N/4] the samples must be rotated by 0, 270, 180 or 90, which correspond to complex multiplications by 1, j, 1 and j, respectively. These rotations are considered trivial, because they can be carried out by interchanging the real and imaginary components and/or changing the sign of the data. An index I b n 1... b 0 is also added to the graph, where b n 1... b 0 is the binary representation of I. This index is used III. HARDWARE ROTATORS FOR THE FFT This section defines some terms used later in the paper. A general rotator is a rotator that can carry out a rotation by any angle, which is provided as an input. A single constant rotator (SCR) is a rotator that carries out a rotation by a single constant angle [31]. A multiple constant rotator (MCR) is a rotator that rotates by an angle selected among several constant angles. A symmetric angle set (SAS) is a set of angles nπ/2 ± α, where n = 0,..., 3 and α [0, π/4]. Any rotation in a symmetric angle set can be calculated as a rotation by α [0, π/4], a trivial rotation and a exchange of the real and imaginary part of the rotation coefficient. Figure 2 shows the angles used in a 16-point FFT (φ = 0,..., 15) and in a 32- point FFT (φ = 0,..., 31) which correspond to dividing the circumference into 16 and 32 equal parts, respectively. These angles form several symmetric angle sets, which are shown in Tables I and II. An M-rotator or M-rot is a rotator that can rotate any number of angles in M different symmetric angle sets. For instance, a rotator that rotates by 0, 45 and 135 is a 2- rot, as it rotates angles in the symmetric angle sets nπ/2 and nπ/2 ± π/4. Likewise, the FFT twiddle factor W 8 is a 2-rot, the twiddle factor W 16 is a 3-rot (note the 3 SAS in Table I) and the twiddle factor W 32 is a 5-rot (note the 5 SAS in Table II). The symbols used in this paper for the different types of rotators are shown in Fig. 3. Table III shows an estimation of the hardware cost of different types of rotators in terms of adders. The table is divided into rotators for arbitrary scaling and unity scaling. Arbitrary scaling means that any scaling is accepted and unity scaling

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 3 TABLE III ESTIMATION OF EQUIVALENT NUMBER OF ADDERS IN FFT ROTATORS. Rotator Hardware Cost Type Approach Add. Rot. Multiplexers Add. Mux. Scaling Compensation Total Equivalent Adders ARBITRARY SCALING 1-rot CCSSI [31]: SCR arbitrary scaling rot CCSSI [31]: MCR uniform scaling rot CCSSI [31]: MCR uniform scaling General CORDIC II [32] General Memoryless CORDIC [33] General Complex Multiplier: Booth encoding UNITY SCALING 1-rot CCSSI [31]: SCR with unity scaling rot CCSSI [31]: MCR with unity scaling rot CCSSI [31]: MCR with unity scaling General CORDIC II [32] General Memoryless CORDIC [33] General Complex Multiplier: Booth encoding Fig. 3. Symbols used in the paper for the different types of rotators. means that the scaling must be a power of two [31]. M-rots are obtained using the CCSSI approach [31], whereas general rotators are obtained either using the CORDIC algorithm [32], [33] or complex multipliers. The table shows the adders (Add. Rot.) and multiplexers involved in the rotation. Considering that a multiplexer is equivalent to 1/4 adders [34], the column Add. Mux. shows the equivalent number of adders of the multiplexers involved in the rotation. The next column is the number of adders needed for scaling compensation. This is not needed in case of arbitrary scaling. The scaling compensation factor in the CORDIC II is K = 1/( ) = = Thus, it needs 6 adders, 3 for the real part and three for the imaginary part. The scaling compensation in the memoryless CORDIC needs 4 adders, 2 for the real part and 2 for the imaginary part [33]. The last column of the table shows the total number of equivalent adders, which is the result of adding the adders involved in the rotation, the equivalent adders of the multiplexers and the adders of the scaling compensation. We use the results in Table III to compute the number of equivalent adders of the proposed FFT architectures. IV. DESIGNING AN FFT HARDWARE ARCHITECTURE In order to design an FFT hardware architecture, we have to be aware of the FFT properties introduced in [16]. The first property, which is general for any FFT architecture and any N, is that at any FFT stage, butterflies operate on data whose index I differ in b n s, where n is the number of FFT stages and s is the specific stage that we are considering. This fact can be observed in the flow graph of Fig. 1. In this flow graph, the index has n = 4 bits, i.e., I b 3 b 2 b 1 b 0. At the first stage, the butterflies operate on samples whose indexes differ in b n s = b 4 1 = b 3. This happens for samples with indexes 0 and 8, 1 and 9, etc. For the second stage, the different bit is b n s = b 4 2 = b 2. Note for instance, that the data with indexes 0 and 4 are operated together in the butterfly at stage 2. For the third and fourth stages, the correponding bits are b 1 and b 0, respectively. Therefore, if we want to design an FFT hardware architecture, we have to assure that at each stage s, the indexes of the inputs to any butterfly at any time instant differ in b n s. Note that the term butterfly refers now to a hardware component of the architecture, not to the mathematical operation of the algorithm in the flow graph. If we consider the example of Fig. 3 in [16], we observe that this property is fulfilled at all the stages. In this figure b n s is at the lowest parallel dimension, which corresponds to the pair of samples that flow into the butterflies at the same clock cycle. The property of b n s is the only requirement set by the butterflies in FFT hardware architecture. As long as this property is met, we can have any data order at the different FFT stages. This allows for exploring a variety of data orders at the FFT stages. This is what is done in the current paper, as explained later in Section V. The second FFT property refers to the rotations at the FFT stages. At each stage, any sample with index I must be rotated according to equation (2) by a value φ that depends on the index I and on the stage, s. We can represent it as φ s (I). The work [30] explains how to calculate φ s (I), which only depends on the FFT algorithm that is used. By combining these ideas, we lead to the following conclusions: On the one hand, any order at any stage of any FFT hardware architecture is possible as long as the property of b n s is met at the input of the butterflies. On the other hand, to any index I at any FFT stage corresponds a specific rotation φ s (I). As a result, we can play with the data order at different FFT stages to look for patterns that allow for a more optimized distribution of rotations. This is the idea behind the proposed rotator allocation approach. V. FFT DESIGN USING ROTATOR ALLOCATION The purpose of the FFT design using rotator allocation is to distribute the rotations of the FFT in such a way that the number and complexity of the required rotators is reduced. Fig. 4(a) shows an example of a layout for the first three stages of a 16-point 4-parallel FFT. The indexes in the figure

5 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 4 (a) (b) (c) (d) Fig. 4. Rotator allocation of a 16-point 4-parallel FFT. (a) Layout not aware of the rotator allocation. (b) FFT architecture for the layout in Fig. 4(a). (c) Layout aware of the rotator allocation. (d) FFT architecture for the layout in Fig. 4(c). show how data flows at the different stages. Each element in the matrices of indexes is the index value according to Fig. 1. Data flows from left to right. Thus, values in the same column are data that flow in parallel and values in the same row flow through the same path in consecutive clock cycles. The matrices of rotations show the value φ s (I) that corresponds to each of the indexes according to the flow graph in Fig. 1. For instance, the rotation corresponding to the index 10 at stage 1 is φ s (I) = φ 1 (10) = 2 according to Fig. 1. This case is highlighted in Fig. 4(a). As for the indexes, each column in the matrices of rotations are rotations that are calculated in parallel at the same clock cycle, whereas rotations in the same row are calculated in consecutive clock cycles by the same rotator. According to this, each row of the matrices of rotations are the rotations that must be calculated by a single rotator in consecutive clock cycles. For instance, stage 1 needs a rotator by φ = {0, 1, 2, 3} and a rotator by φ = {4, 5, 6, 7}, which are 3-rots according to Table I. Stage 2 includes 2 2-rots and stage 3 includes 2 trivial rotators (T). The layout in Fig. 4(a) translates into the FFT architecture in Fig. 4(b), which shows the rotators at each stage, as well as the content of the rotation memories. By applying rotator allocation we aim to reduce the number of rotators and their complexity. Rotator allocation simply consists of reorganizing the matrices of indexes and, therefore, the matrices of rotations, in such a way that the matrices of rotations have less rotators (if possible) and the complexity of the rotators is smaller. On the one hand, fewer rotators are achieved when there are more rows in the matrices of rotations whose elements are φ = 0. On the other hand, the complexity of the rotators is reduced when the rotations in the same row are in less SAS. The procedure of rotator allocation consists in distributing the bits b n 1... b 0 of the index I into serial and parallel dimensions. Serial dimensions correspond to data arriving at the same input terminal in series and parallel dimensions refers to data arriving at parallel terminals. Depending on the FFT size N = 2 n and the number of parallel data in the FFT P = 2 p, the number of serial bits is n p = log 2 (N) log 2 (P ) and the number of parallel ones is p = log 2 (P ). For the example in Fig. 4, N = 16 and P = 4, so there are n p = log 2 (16) log 2 (4) = 2 serial dimensions and p = log 2 (P ) = 2 parallel ones. The alternatives to allocate the bits correspond to all the possible permutations of the bits, i.e., Serial Parallel b 3 b 2 b 1 b 0 b 2 b 3 b 1 b 0 b 3 b 2 b 0 b 1 b 2 b 3 b 0 b 1 b 3 b 1 b 2 b 0 b 1 b 3 b 2 b 0 b 3 b 1 b 0 b 2 b 1 b 3 b 0 b 2 b 3 b 0 b 2 b 1 b 0 b 3 b 2 b 1 b 3 b 0 b 1 b 2 b 0 b 3 b 1 b 2 b 0 b 1 b 2 b 3 b 1 b 0 b 2 b 3 b 0 b 1 b 3 b 2 b 1 b 0 b 3 b 2 b 0 b 2 b 1 b 3 b 2 b 0 b 1 b 3 b 0 b 2 b 3 b 1 b 2 b 0 b 3 b 1 b 1 b 2 b 0 b 3 b 2 b 1 b 0 b 3 b 1 b 2 b 3 b 0 b 2 b 1 b 3 b 0 However, neither the order of the serial bits nor the order of the parallel bits affect the complexity of the rotators: A different order of the serial bits changes the order of the rotations, but the same rotations are calculated by the rotators. A different order of the parallel bits changes the edges in which the rotators are placed, but the rotators are the same. Therefore, the alternatives in each row of equation (3) have (3)

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 5 the same complexity, so only one alternative per row needs to be evaluated. According to this, the number of alternatives that need to be evaluated at each FFT stage is ( n p ) = n! p!(n p)! Note that we can choose any order in equation (3) and design the data management of the FFT to achieve the desired order. However, it is in general a good idea to place the rotators close to the butterflies of the same state. This requires that the data order at the rotator respects the property of b n s explained in previous section. Alternatively, if the property of b n s is not met at the rotator, then additional permutation circuits are needed between the butterfly and the rotator. The architectures in this paper correspond to the case of placing the rotator just after the butterfly. Figure. 4(c) shows a layout of indexes and rotations that is aware of the rotator allocation, and the corresponding FFT architecture is shown in Fig. 4(d). The bits of the serial and parallel dimensions at each stage are shown in Fig. 4(c). In this layout, stage 1 requires a 2-rot and a 1-rot, stage 2 a trivial rotator and a 1-rot, and stage 3 a trivial rotator. The trivial rotator in stage 3 can be hard wired due to the fact that all the rotations have the same value. Comparing this example with the example in Figs. 4(a) and 4(b), it can be observed that the number of rotators and their complexity have been reduced. Instead of 2 3-rots, 2 2-rots and 2 trivial rotators, the design aware of rotator allocation only needs a 2-rot, 2 1-rot and 1 trivial rotator. As a conclusion, the rotator allocation can simplify significantly the complexity of the rotators at the different stages of the FFT, while achieving the same complexity in terms of butterflies and shuffling circuits. VI. PROPOSED FEEDFORWARD FFT ARCHITECTURES BASED ON ROTATOR ALLOCATION A. Radix-2 feedforward FFTs based on rotator allocation Fig. 5 shows the proposed 32-point 4-parallel radix-2 DIF FFT architecture based on rotator allocation. The architecture consists of five stages that include radix-2 butterflies (R2), rotators and circuits for data management. In total, the architecture requires one 3-rot, two 2-rot, two 1-rot and a trivial rotator. As for the 16-point FFT architecture in Fig. 4(d), the trivial rotator in the last stage can be hard wired. Note also that the 32-point architecture is equal to the 16-point architecture plus an extra first stage. The twiddle factor values (φ) for the different stages of the architecture are shown in Table IV. Fig. 6 shows the proposed 32-point 8-parallel radix-2 DIF FFT architecture based on rotator allocation. This architecture can be derived from the 4-parallel radix-2 FFT in Fig. 5 in the following way: Samples that arrive in even clock cycles in the 4-parallel architecture are processed by the upper part of the 8-parallel architecture, whereas samples that arrive in odd clock cycles in the 4-parallel architecture are processed by the lower part of the 8-parallel architecture. In the 8-parallel architecture, the rotators are simplified compared to the 4- parallel architecture. Therefore, the FFT hardware does not (4) TABLE IV TWIDDLE FACTORS VALUES (φ) FOR THE 32-POINT 4-PARALLEL RADIX-2 MDC DIF FFT ARCHITECTURE IN FIG. 5 Stage 1 Stage 2 Stage 3 Stage rot rot rot rot T rot duplicate when duplicating the number of samples in parallel. Only the number of butterflies is duplicated. The memory is reduced and the rotators are almost duplicated in number, but simplified in complexity. Figure 7 shows the proposed 4-parallel radix-2 FFTs for decimation in time (DIT) [29]. It can be noted that this architecture is symmetric with respect to the DIF one: Whereas for the DIF case the length of the buffers and the complexity of the rotators decrease with the number of the stage, for the DIT case it is the opposite. As a result, the amount of hardware resources for the DIF and DIT versions is the same. The 8- parallel radix-2 DIT FFT is obtained in a similar way. For larger N, any N-point P -parallel radix-2 DIF FFT architectures based on rotator allocation is obtained by adding one extra stage at the beginning of the N/2-point P -parallel radix-2 DIF architecture, as can be observed when comparing the 32-point 4-parallel radix-2 FFT in Fig. 5 and the 16-point 4-parallel radix-2 FFT in Fig 4(d). Analogously, for the DIT case any N-point P -parallel DIT FFT is obtained by adding one extra stage at the end of the N/2-point P -parallel radix-2 DIT architecture. In both DIF and DIT cases, further stages added to the 32-point FFT require general rotators. Thus, the 64-point radix-2 FFT requires the same rotators as the 32- point radix-2 FFT plus two general rotators. For any radix-2 4-parallel (DIF or DIT) FFT based on rotator allocation, the type and number of rotators are shown in Table V. For the 8-parallel case, the type and number of rotators are shown in Table VI. Both in the 4-parallel and 8-parallel architectures, the number of general rotators increases with N. It would be desired that the additional rotators are simpler than a general rotator. This motivates the use of radix-2 k instead of radix-2, as shown in the next section. B. Radix-2 k feedforward FFTs based on rotator allocation Radix-2 k feedforward FFTs are obtained by combining the radix-2 FFTs in previous section. For instance, the architecture in Fig. 8 is a 256-point 4-parallel radix-2 4 DIF FFT. It consists of two 16-point FFTs connected by the rotators and the shuffling circuits in stage 4. The FFT algorithm used in this architecture is represented by the binary tree in Fig. 9(i). The 16-point FFTs are the same as that in Fig. 4(d), with the only

7 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 6 Fig. 5. Proposed 32-point 4-parallel radix-2 MDC DIF FFT architecture. Fig. 6. Proposed 32-point 8-parallel radix-2 MDC DIF FFT architecture. difference in the length of the buffers of the first 16-point FFT. The connection stage has 4 general rotators. This is the same for all radix-2 k FFTs presented next. As a result, the radix-2 k feedforward FFTs have P/2 log 2 N butterflies, a total memory size N P, and the rotators of the radix-2 sub-ffts plus P general rotators per connection stage. Table VII presents the proposed radix-2 k feedforward FFTs based on rotator allocation for 4-parallel samples. The table is divided into Selected architectures and Other alternatives with higher hardware cost. The selected architectures are the ones that achieve smallest hardware cost. The first two columns of the table show the number of points, N, and the radix. The third column indicates the FFT algorithms used in these architectures, which are represented by the binary trees in Fig. 9. Note that the binary tree provides the values of the rotation angles at each FFT stage [35]. The fourth to the seventh columns show the number of rotators that the architecture requires. The general rotators are separated into two groups: The general rotators in the radix-2 sub-ffts and the general rotators used to interconnect sub-ffts. Finally, the last column indicates the total number of equivalent real adders of the entire architecture, including the adders of the butterflies and rotators, and the equivalent number of real adders of the multiplexers: Equiv. Adders = 2P log 2 N +AddersRot+2P/4 log 2 (N/P ), (5) where the adders for the rotators (AddersRot) are computed according to the estimations in Table III.

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 7 Fig. 7. Proposed 32-point 4-parallel radix-2 MDC DIT FFT architecture. Fig point 4-parallel radix-2 4 MDC DIF FFT architecture. TABLE V ROTATORS IN THE PROPOSED 4-PARALLEL RADIX-2 MDC FFT ARCHITECTURES. FFT Algorithm Rotators N Radix General 3-rot 2-rot 1-rot TABLE VI ROTATORS IN THE PROPOSED 8-PARALLEL RADIX-2 MDC FFT ARCHITECTURES. FFT Algorithm Rotators N Radix General 3-rot 2-rot 1-rot In Table VII, radix-2 k starts to get better results than radix- 2 for N = 64, whereas radix-2 is preferable for sizes smaller than or equal to 32. Analogously, Table VIII shows the proposed radix-2 k feedforward FFTs based on rotator allocation for 8-parallel samples. In this case, radix-2 k starts to obtain better results than radix-2 for N = 128. VII. COMPARISON Table IX compares the proposed architectures to previous ones in terms of complex adders and rotators. The table includes the rotators used in previous architectures as a function of n = log 2 (N). For the proposed architectures there is no generalized expression. Instead, the number of rotators is provided by the Selected architectures in Tables VII and VIII. For a better comparison of the number of rotators, Table IX includes the examples of N = 1024 and N = The architectures are sorted out by the exponent k = 0,..., 4 of the radix-2 k. It can be observed that radix-2 architectures lead to the largest number of complex rotators, and this number decreases as we increase k until radix-2 4. For N = 1024 the proposed architectures reduce this number even further, as they require half the general rotators of previous radix-2 4 FFT architectures, both for 4-parallel and 8-parallel designs. Note also that when not using rotator allocation, a 1024-point FFT that uses radix-2 5 requires 8 general rotators (4 W 1024 and 4 W 32 ), 4 3-rot and 4 2-rot, and a 1024-point FFT that

9 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 8 Fig. 9. Binary trees of the proposed feedforward FFTs. TABLE VII PROPOSED 4-PARALLEL RADIX-2 k MDC FFT ARCHITECTURES. FFT Algorithm Rotators Eqiv. N Radix Tree Gen. 3-rot 2-rot 1-rot Adders SELECTED ARCHITECTURES 4 2 (a) 0 / (b) 0 / (c) 0 / (d) 0 / (f) 0 / ,2 3 (h) 0 / (i) 0 / ,2 4 (k) 0 / (m) 0 / ,2 4,2 3 (o) 0 / (q) 0 / OTHER ALTERNATIVES WITH HIGHER HW COST 64 2 (e) 2 / (g) 4 / ,2 3 (j) 0 / (l) 0 / ,2 5 (n) 2 / (p) 4 / (r) 0 / TABLE VIII PROPOSED 8-PARALLEL RADIX-2 k MDC FFT ARCHITECTURES. FFT Algorithm Rotators Eqiv. N Radix Tree Gen. 3-rot 2-rot 1-rot Adders SELECTED ARCHITECTURES 8 2 (b) 0 / (c) 0 / (d) 0 / (e) 2 / ,2 3 (h) 0 / (i) 0 / ,2 4 (k) 0 / (m) 0 / ,2 5 (n) 2 / (q) 0 / OTHER ALTERNATIVES WITH HIGHER HW COST (f) 0 / (g) 6 / ,2 3 (j) 0 / (l) 0 / ,2 4,2 3 (o) 0 / (p) 4 / (r) 0 / uses radix-2 4 [16] requires 8 general rotators (4 W 1024 and 4 W 32 ) and 6 3-rot, as shown in Table IX. Thus, radix-2 4 and a radix-2 5 FFTs without rotator allocation are comparable in area. However, when using rotator allocation, these numbers are improved noticeably. For the case of N = 4096 in Table IX, the number of general rotators of the proposed architectures is the same as those in previous radix-2 4 FFTs. However, the number of nongeneral non-trivial rotators and their complexity is reduced significantly by using the proposed approach. The total area occupied by the rotators is compared in Figs. 10 and 11 for 4-parallel and 8-parallel FFTs, respectively. The area is measured in terms of equivalent adders, considering the adder cost of each type of rotator according to Table III. Figs. 10 and 11 show that the rotator area of the proposed architectures is smaller than that in previous approaches for all FFT sizes. For 4-parallel architectures, the rotator area reduction ranges from 17% to 36%. For 8-parallel architectures, it ranges from 23% for to 50%. This highlights the significant improvements of using the rotator allocation approach.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 9 TABLE IX COMPARISON OF PIPELINE PARALLEL FFT ARCHITECTURES IN TERMS OF COMPLEX ADDERS AND ROTATORS AREA PIPELINED Complex Rotators Rotators (N=1024) Rotators (N=4096) ARCHITECTURE Adders General 3-rot 2-rot Gen. 3-rot 2-rot 1-rot Gen. 3-rot 2-rot 1-rot 4-PARALLEL ARCHITECTURES R2 MDC, [19] 4n + 4 2n R4 MDC, [4] 4n 3 n/ R4 MDC, [15] 4n 3 n/ R2 2 MDC, [16] 4n 3 n/ R2 3 MDC, [16] 4n 4 n/ n/ R2 4 MDF, [6] 4n 4 (n 2)/4 1 4 (n 2)/ R2 4 MDF, [8] 8n 4 n/4 4 4 n/ R2 4 MDF, [12] 8n 4 n/4 4 4 n/ R2 4 MDC, [16] 4n 4 n/4 4 3 n/ Proposed, MDC 4n Table VII (Selected architectures) PARALLEL ARCHITECTURES R2 MDC, [36] 8n 4n R2 MDF, [37] 16n 4n R2 2 MDC, [16] 8n 6 n/ R8 MDC, [4] 8n 7 n/ n/ R2 3 MDC, [16] 8n 7 n/ n/ R2 4 MDF, [6] 8n 8 (n 3)/4 1 8 (n 3)/ R2 4 MDF, [7] 16n 8 n/4 8 8 n/ R2 4 MDC, [16] 8n 8 n/4 8 6 n/ Proposed, MDC 8n Table VIII (Selected architectures) : It requires 2 additional 2-rot for 4-parallel if mod(n 2, 4) = 3, and 4 2-rot for 8-parallel if mod(n 3, 4) = 3, to calculate the W 8 rotations. : If mod(n, 4) = 3, it requires 1 2-rot for 4-parallel and 2 2-rot for 8-parallel to calculate the W 8 rotations. : If mod(n, 4) = 3, it requires 2 2-rot for 4-parallel and 4 2-rot for 8-parallel to calculate the W 8 rotations. Rotator area (Equivalent adders) Proposed MDC 2 2 [16, 17] MDC 2 3 [16] MDC 2 4 [16] MDF 2 4 [6] MDF 2 4,2 3 [13] Rotator area (Equivalent adders) Proposed MDC 2 2 [16, 17] MDC 2 3 [16] MDC 2 4 [16] MDF 2 4 [6] MDC 2 5,2 2 [18] MDC 2 6,2 2 [20] MDC 2 3 [21] MDF 2 4,2 4,2 3 [7] N N Fig. 10. Rotator area of 4-parallel feedforward FFT architectures in terms of equivalent adders. Fig. 11. Rotator area of 8-parallel feedforward FFT architectures in terms of equivalent adders. For the area of the entire FFT architecture, Figs. 12 and 13 compare the proposed feedforward FFTs based on rotator allocation with previous feedforward FFT architectures for 4- parallel and 8-parallel, respectively. In all the architectures the total memory is the same, N P. Therefore, the comparison is done in terms of equivalent adders of rotators, butterflies and shuffling circuits. For the proposed architectures, the number of equivalent adders comes from Tables VII and VIII, and equation (5). For the previous architectures, we calculate the number of equivalent adders in the same way as for the proposed architectures, i.e., considering the equivalent adders in butterflies and shuffling circuits, and an optimized implementation of the rotators according to Table III. Figs. 12 and 13 show that the proposed architectures reduce

11 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 10 FFT area (Equivalent adders) Proposed MDC 2 2 [16, 17] MDC 2 3 [16] MDC 2 4 [16] MDF 2 4 [6] MDF 2 4,2 3 [13] N Fig. 12. FFT area of 4-parallel feedforward FFT architectures in terms of equivalent adders (including butterflies, rotators and multiplexers and excluding memories). TABLE X COMPARISON OF 4-PARALLEL 1024-POINT FFTS ON FPGAS. FFT Ours14 Glittas16 Wang16 Spiral17 Proposed Parametes [16], [17] [19] [6] [26] - N P Radix Word lenth FPGA Virtex6 Virtex5 Virtex6 Virtex6 Virtex6 Slices BRAM DSP slices Clk (MHz) Th (MS/s) Slices (Hundreds) BRAM DSP 40 FFT area (Equivalent adders) 1200 Proposed MDC 2 2 [16, 17] MDC 2 3 [16] 1000 MDC 2 4 [16] MDF 2 4 [6] 800 MDC 2 5,2 2 [18] MDC 2 6,2 2 [20] MDC 2 3 [21] 600 MDF 2 4,2 4,2 3 [7] N Fig. 13. FFT area of 8-parallel feedforward FFT architectures in terms of equivalent adders (including butterflies, rotators and multiplexers and excluding memories). the number of equivalent adders with respect to previous approaches. For 4-parallel and N = 4096 points, the savings of using rotator allocation are around 15% with respect to radix-2 4 [16], and 40% with respect to radix-2 2 [16], [17]. For 8-parallel and N = 4096, the savings are around 22% with respect to radix-2 4 architectures [16], and 48% with respect to radix-2 2 [16], [17]. VIII. IMPLEMENTATION In order to verify the advantages of the proposed architectures, we have implemented the proposed 1024-point 4- parallel radix-2 5 DIF FFT based on rotator allocation, shown in Table VII. The input data word length is chosen to be 16 bits. Similar to the case of the 256-point 4-parallel radix-2 4 MDC DIF FFT in Fig. 8, the 1024-point 4-parallel radix-2 5 DIF FFT can be derived by cascading two 32-point MDC DIF Ours14 [16, 17] Glittas16 [19] Wang16 [6] Spiral17 [26] Proposed Fig. 14. Area comparison of 4-parallel 1024-point FFTs on FPGAs. FFTs as that in Fig. 5, changing the length of the buffers of one of them, and including rotators and shuffling circuits between both of them. A. FPGA results Table X compares 4-parallel 1024-point FFT architectures on FPGAs. The references to previous works include the first author and the year of the work. For instance, Wang16 means Wang, The table includes the area in terms of slices, BRAM and DSP slices, and the performance in terms of clock frequency (Clk) and throughput (Th). The proposed results are obtained for a Virtex-6 XC6VSX475T- 1-FF1156 FPGA. The comparison shows that the proposed architecture reduces by 64% or more the number of DSP slices with respect to previous approaches. This is due to the fact that the proposed architecture has only 4 general rotators, which are implemented in the DSP slices. The rest of rotators are implemented as distributed logic (slices). This, however, does not increase significantly the amount of distributed logic: The proposed architecture requires approximately the same distributed logic as [16], [17], which also require 12 BRAM. Figure 14 shows the area comparison as a bar diagram. It can be observed that our approach saves a significant amount of DSP slices used for rotators with respect to previous approaches, while keeping a small number of slices and BRAM.

12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 11 Throughput (MSamples/s) Ours16 (1P) [2] Proposed Ours14 (4P) [16,17] Spiral17 (2P) [26] Ours14 (8P) [16,17] Glittas16 (4P) [19] Wang16 (4P) [6] Spiral17 (4P) [26] Xilinx LogiCORE v8.0 (1P) [27] Spiral17 (8P) [26] FPGA utilization (%) of a Virtex-6 XC6VSX475T-1-FF1156 Fig. 15. Throughput vs. FPGA utilization of 1024-point FFT hardware architectures. TABLE XI PROPOSED 4-PARALLEL 1024-POINT FFTS ON ASICS. FFT Proposed Proposed Parametes (low power) (high performance) N P 4 4 Radix Word length Technology (nm) Voltage (V) Clk (MHz) Th (MS/s) Latency (clycles) Latency (µs) Area (mm 2 ) SQNR (db) Power (mw) MHz The difference between these two versions is that the high performance approach includes additional pipelining. The experimental results show that the low power version only requires 8.88 mw at 180 Mhz. The high performance version calculates the 1024-point FFT in 0.83 µs at 1.28 GS/s. In both versions, the SQNR [38], [39] is 40.3 db, which meets the SQNR needed in applications such as UWB [18], Wi-Fi [18] and gigabit WPAN [40]. Finally, the area of the proposed FFTs is small, around 0.2 mm 2. This meets the main purpose of the paper, i.e., achieving a small area for the FFT. IX. CONCLUSIONS In this paper, we have presented the rotation allocation approach. It consists in finding an efficient distribution of FFT rotations that reduces the number of rotators and their complexity. This leads to new radix-2 and radix-2 k MDC FFT architectures. These architectures require the same memory and butterflies as previous MDC FFTs, but fewer and/or simpler rotators. For 4-parallel FFTs, the area of rotators is reduced by a factor from 17% to 36% with respect to previous approaches. For 8-parallel FFTs, the rotator area reduction ranges from 23% to 50%. These savings are confirmed with the implementation of a 4-parallel 1024-point radix-2 5 MDC FFT based on rotation allocation. Experimental results on FPGAs show significant reduction in area with respect to previous 4-parallel 1024-point FFTs, with 64% less DSP slices than previous FFT architectures. This leads to 33% less total FFT area. Experimental results on ASICs lead to a very small area, together with low power consumption or high performance. X. ACKNOWLEDGMENT The authors would like to thank Dr. Martin Kumm for his help with the calculation of the adder cost of the rotators. Figure 15 compares the throughput versus the FPGA utilization for 1024-point FFTs on FPGAs. The FPGA utilization is calculated as the percentage of FPGA hardware resources used of a a Virtex-6 XC6VSX475T-1-FF1156 FPGA, according to the definition in [17]. Fig. 15 includes both serial FFTs (1P) and parallel FFTs (2P, 4P, 8P). The architectures improve as they increase in throughput and/or decrease in FPGA utilization, towards the upper left corner. Both the throughput and the FPGA utilization increase with the parallelization of the FFT. Serial FFTs [2], [27] appear in the lower left corner and 8-parallel FFTs appear towards the upper right one. The proposed FFT provides a good trade-off between throughput and resources. Its FPGA utilization is 1.03% and the throughput is 1012 MS/s. This is 33% less area and 11% more throughput than the closest FFT [16], [17]. Furthermore, it has significantly smaller FPGA utilization than previous parallel FFTs, while keeping a high throughput. B. ASIC results Table XI shows the ASIC results of the proposed FFT in two different versions: low power and high performance. REFERENCES [1] L. Yang, K. Zhang, H. Liu, J. Huang, and S. Huang, An efficient locally pipelined FFT processor, IEEE Trans. Circuits Syst. II, vol. 53, no. 7, pp , Jul [2] M. Garrido, R. Andersson, F. Qureshi, and O. Gustafsson, Multiplierless unity-gain SDF FFTs, IEEE Trans. VLSI Syst., vol. 24, no. 9, pp , Sep [3] S. He and M. Torkelson, Design and implementation of a 1024-point pipeline FFT processor, in Proc. IEEE Custom Integrated Circuits Conf., May 1998, pp [4] M. Sánchez, M. Garrido, M. López, and J. Grajal, Implementing FFTbased digital channelized receivers on FPGA platforms, IEEE Trans. Aerosp. Electron. Syst., vol. 44, no. 4, pp , Oct [5] A. Cortés, I. Vélez, and J. F. Sevillano, Radix r k FFTs: Matricial representation and SDC/SDF pipeline implementation, IEEE Trans. Signal Process., vol. 57, no. 7, pp , Jul [6] J. Wang, C. Xiong, K. Zhang, and J. Wei, A mixed-decimation MDF architecture for radix- 2 k parallel FFT, IEEE Trans. VLSI Syst., vol. 24, no. 1, pp , Jan [7] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, A 2.4-GS/s FFT processor for OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II, vol. 57, no. 6, pp , Jun [8] H. Liu and H. Lee, A high performance four-parallel 128/64-point radix-2 4 FFT/IFFT processor for MIMO-OFDM systems, in Proc. IEEE Asia Pacific Conf. Circuits Syst., Nov. 2008, pp [9] L. Liu, J. Ren, X. Wang, and F. Ye, Design of low-power, 1GS/s throughput FFT processor for MIMO-OFDM UWB communication system, in Proc. IEEE Int. Symp. Circuits Syst., May 2007, pp

13 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS PART I: REGULAR PAPERS 12 [10] J. Lee, H. Lee, S. in Cho, and S.-S. Choi, A high-speed, low-complexity radix-2 4 FFT processor for MB-OFDM UWB systems, in Proc. IEEE Int. Symp. Circuits Syst., May 2006, pp [11] N. Li and N. van der Meijs, A Radix 2 2 based parallel pipeline FFT processor for MB-OFDM UWB system, in Proc. IEEE Int. SoC Conf., Sep. 2009, pp [12] S.-I. Cho, K.-M. Kang, and S.-S. Choi, Implemention of 128-point fast Fourier transform processor for UWB systems, in Proc. Int. Wireless Comm. Mobile Comp. Conf., Aug. 2008, pp [13] S.-I. Cho and K.-M. Kang, A low-complexity 128-point mixed-radix FFT processor for MB-OFDM UWB systems, ETRI J., vol. 32, no. 1, pp. 1 10, Feb [14] W. Xudong and L. Yu, Special-purpose computer for 64-point FFT based on FPGA, in Proc. Int. Conf. Wireless Comm. Signal Process., Nov. 2009, pp [15] K.-J. Yang, S.-H. Tsai, and G. Chuang, MDC FFT/IFFT processor with variable length for MIMO-OFDM systems, IEEE Trans. VLSI Syst., vol. 21, no. 4, pp , Apr [16] M. Garrido, J. Grajal, M. A. Sánchez, and O. Gustafsson, Pipelined radix-2 k feedforward FFT architectures, IEEE Trans. VLSI Syst., vol. 21, no. 1, pp , Jan [17] M. Garrido, M. Acevedo, A. Ehliar, and O. Gustafsson, Challenging the limits of FFT performance on FPGAs, in Int. Symp. Integrated Circuits, Dec. 2014, pp [18] M. G. Kim, S. K. Shin, and M. H. Sunwoo, New parallel MDC FFT processor with efficient scheduling scheme, in Proc. IEEE Asia-Pacific Conf. Circuits Syst., Nov. 2014, pp [19] A. X. Glittas, M. Sellathurai, and G. Lakshminarayanan, A normal I/O order radix-2 FFT architecture to process twin data streams for MIMO, IEEE Trans. VLSI Syst., vol. 24, no. 6, pp , Jun [20] J. K. Jang, M. G. Kim, and M. H. Sunwoo, Efficient scheduling scheme for eight-parallel MDC FFT processor, in Int. SoC Design Conf., Nov. 2015, pp [21] T. Ahmed, M. Garrido, and O. Gustafsson, A 512-point 8-parallel pipelined feedforward FFT for WPAN, in Proc. Asilomar Conf. Signals Syst. Comput., Nov. 2011, pp [22] M. Garrido, S.-J. Huang, S.-G. Chen, and O. Gustafsson, The serial commutator (SC) FFT, IEEE Trans. Circuits Syst. II, vol. 63, no. 10, pp , Oct [23] Y.-W. Lin and C.-Y. Lee, Design of an FFT/IFFT processor for MIMO OFDM systems, IEEE Trans. Circuits Syst. I, vol. 54, no. 4, pp , Apr [24] S. Li, H. Xu, W. Fan, Y. Chen, and X. Zeng, A 128/256-point pipeline FFT/IFFT processor for MIMO OFDM system IEEE e, in Proc. IEEE Int. Symp. Circuits Syst., Jun. 2010, pp [25] M. Garrido, Efficient hardware architectures for the computation of the FFT and other related signal processing algorithms in real time, Ph.D. dissertation, Universidad Politécnica de Madrid, Dec [26] Spiral DFT/FFT IP core generator, May 2017, [27] Xilinx - FFT Logicore 8.0, Jul. 2012, documentation/ip documentation/ds808 xfft.pdf. [28] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comput., vol. 19, pp , [29] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing. Prentice Hall, [30] M. Garrido, A new representation of FFT algorithms using triangular matrices, IEEE Trans. Circuits Syst. I, vol. 63, no. 10, pp , Oct [31] M. Garrido, F. Qureshi, and O. Gustafsson, Low-complexity multiplierless constant rotators based on combined coefficient selection and shift-and-add implementation (CCSSI), IEEE Trans. Circuits Syst. I, vol. 61, no. 7, pp , Jul [32] M. Garrido, P. Källström, M. Kumm, and O. Gustafsson, CORDIC II: A new improved CORDIC algorithm, IEEE Trans. Circuits Syst. II, vol. 63, no. 2, pp , Feb [33] M. Garrido and J. Grajal, Efficient memoryless CORDIC for FFT computation, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, Apr. 2007, pp [34] J. Chen and C.-H. Chang, High-level synthesis algorithm for the design of reconfigurable constant multiplier, IEEE Trans. Comput.- Aided Design Integr. Circuits Syst., vol. 28, no. 12, pp , Dec [35] F. Qureshi and O. Gustafsson, Generation of all radix-2 fast Fourier transform algorithms using binary trees, in Proc. European Conf. Circuit Theory Design, Aug. 2011, pp [36] J. A. Johnston, Parallel pipeline fast Fourier transformer, in IEE Proc. F Comm. Radar Signal Process., vol. 130, no. 6, Oct. 1983, pp [37] E. Wold and A. Despain, Pipeline and parallel-pipeline FFT processors for VLSI implementations, IEEE Trans. Comput., vol. C-33, no. 5, pp , May [38] W. H. Chang and T. Q. Nguyen, On the fixed-point accuracy analysis of fft algorithms, IEEE Trans. Signal Process., vol. 56, no. 10, pp , Oct [39] I. B. Dhaou, Hardware architecture for an anti-traffic noise system, Microelectronics Journal, vol. 46, no. 5, p. 370?376, May [40] Y. Chen, Y. C. Tsao, Y. W. Lin, C. H. Lin, and C. Y. Lee, An indexedscaling pipelined FFT processor for OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II, vol. 55, no. 2, pp , Feb Mario Garrido (M 07) received the M.S. degree in electrical engineering and the Ph.D. degree from the Technical University of Madrid (UPM), Madrid, Spain, in 2004 and 2009, respectively. In 2010 he moved to Sweden to work as a postdoctoral researcher at the Department of Electrical Engineering at Linköping University. Since 2012 he is Associate Professor at the same department. His research focuses on optimized hardware design for signal processing applications. This includes the design of hardware architectures for the calculation of transforms, such as the fast Fourier transform (FFT), circuits for data management, the CORDIC algorithm, and circuits to calculate statistical and mathematical operations. His research covers high-performance circuits for real-time computation, as well as designs for small area and low power consumption. Shen-Jui Huang received his M.S. degree from the Department of Electrical Engineering, National Taiwan University in 1997, and his Ph.D. degree in Electronics from National Chiao Tung University in His research interests include baseband signal processing and circuit implementation. Currently he is a deputy manager in Novatek Corp., Hsinchu, Taiwan. Sau-Gee Chen received his B.S. degree from National Tsing Hua University, Taiwan, in 1978, M.S. degree and Ph.D. degree in electrical engineering, from the State University of New York at Buffalo, NY, in 1984 and 1988, respectively. Currently, he is a Professor at the Department of Electronics Engineering, National Chiao Tung University (NCTU), Taiwan, and a member of Board of Governor, IEEE Taipei Section. He was the department Chair of the Department of Electronics Engineering, NCTU, during He has been appointed as the Coordinator of IEEE VTS Asia-Pacific Chapters, since He was the Chair, IEEE Vehicular Technology Society, Taipei Chapter, during He was Director of Honors Program, College of Electrical & Computer Engineering/College of Computer Science from 2011 to 2012 at NCTU. He also was the Associate Dean, Office of International Affairs, during March- July, 2011, as well as the director of Institute of Electronics from 2003 to 2006, all at the same organization. During , he served as an associate editor of IEEE Transactions on Circuits and Systems I. His research interests include digital communication, multi-media computing, digital signal processing, and VLSI signal processing. He has published more than 100 conference and journal papers, and holds 19 US and Taiwan patents.

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI. ww.semargroup.org www.ijvdcs.org ISSN 2322-0929 Vol.02, Issue.05, August-2014, Pages:0294-0298 Radix-2 k Feed Forward FFT Architectures K.KIRAN KUMAR 1, M.MADHU BABU 2 1 PG Scholar, Dept of VLSI & ES,

More information

ISSN Vol.02, Issue.11, December-2014, Pages:

ISSN Vol.02, Issue.11, December-2014, Pages: ISSN 2322-0929 Vol.02, Issue.11, December-2014, Pages:1119-1123 www.ijvdcs.org High Speed and Area Efficient Radix-2 2 Feed Forward FFT Architecture ARRA ASHOK 1, S.N.CHANDRASHEKHAR 2 1 PG Scholar, Dept

More information

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices

A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices Mario Garrido Gálvez, Miguel Angel Sanchez, Maria Luisa Lopez-Vallejo and Jesus Grajal Journal Article N.B.: When citing this work, cite the original

More information

Multiplierless Unity-Gain SDF FFTs

Multiplierless Unity-Gain SDF FFTs Multiplierless Unity-Gain SDF FFTs Mario Garrido Gálvez, Rikard Andersson, Fahad Qureshi and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 216 IEEE. Personal

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

World s Fastest FFT Architectures: Breaking the Barrier of 100 GS/s

World s Fastest FFT Architectures: Breaking the Barrier of 100 GS/s World s Fastest FFT Architectures: Breaking the Barrier of 100 GS/s Mario Garrido, K. Möller and M. Kumm The self-archived postprint version of this journal article is available at Linköping University

More information

AN FFT PROCESSOR BASED ON 16-POINT MODULE

AN FFT PROCESSOR BASED ON 16-POINT MODULE AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,

More information

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS

DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS DESIGN & SIMULATION PARALLEL PIPELINED RADIX -2^2 FFT ARCHITECTURE FOR REAL VALUED SIGNALS Madhavi S.Kapale #1, Prof.Nilesh P. Bodne #2 1 Student Mtech Electronics Engineering (Communication) 2 Assistant

More information

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT

An Area Efficient Mixed Decimation MDF Architecture for Radix. Parallel FFT An Area Efficient Mixed Decimation MDF Architecture for Radix Parallel FFT Reshma K J 1, Prof. Ebin M Manuel 2 1M-Tech, Dept. of ECE Engineering, Government Engineering College, Idukki, Kerala, India 2Professor,

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

FAST FOURIER TRANSFORM (FFT) and inverse fast

FAST FOURIER TRANSFORM (FFT) and inverse fast IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 11, NOVEMBER 2004 2005 A Dynamic Scaling FFT Processor for DVB-T Applications Yu-Wei Lin, Hsuan-Yu Liu, and Chen-Yi Lee Abstract This paper presents an

More information

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture

An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT Architecture IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 10 April 2016 ISSN (online): 2349-784X An Efficient High Speed VLSI Architecture Based 16-Point Adaptive Split Radix-2 FFT

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT

DESIGN OF PARALLEL PIPELINED FEED FORWARD ARCHITECTURE FOR ZERO FREQUENCY & MINIMUM COMPUTATION (ZMC) ALGORITHM OF FFT IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 199-206 Impact Journals DESIGN OF PARALLEL PIPELINED

More information

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture

SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Journal of Signal Processing Systems (2018) 90:1583 1592 https://doi.org/10.1007/s11265-018-1370-y SFF The Single-Stream FPGA-Optimized Feedforward FFT Hardware Architecture Carl Ingemarsson 1 Oscar Gustafsson

More information

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

Modified Welch Power Spectral Density Computation with Fast Fourier Transform Modified Welch Power Spectral Density Computation with Fast Fourier Transform Sreelekha S 1, Sabi S 2 1 Department of Electronics and Communication, Sree Budha College of Engineering, Kerala, India 2 Professor,

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant

More information

Low Power Complex Multiplier based FFT Processor

Low Power Complex Multiplier based FFT Processor Low Power Complex Multiplier based FFT Processor V.Sarada, Dr.T.Vigneswaran 2 ECE, SRM University, Chennai,India saradasaran@gmail.com 2 ECE, VIT University, Chennai,India vigneshvlsi@gmail.com Abstract-

More information

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications

Research Article Design of A Novel 8-point Modified R2MDC with Pipelined Technique for High Speed OFDM Applications Research Journal of Applied Sciences, Engineering and Technology 7(23): 5021-5025, 2014 DOI:10.19026/rjaset.7.895 ISSN: 2040-7459; e-issn: 2040-7467 2014 Maxwell Scientific Publication Corp. Submitted:

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL

IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL IMPLEMENTATION OF FAST FOURIER TRANSFORM USING VERILOG HDL 1 ANUP TIWARI, 2 SAMIR KUMAR PANDEY 1 Department of ECE, Jharkhand Rai University,Ranchi, Jharkhand, India 2 Department of Mathematical Sciences,

More information

DISCRETE COSINE TRANSFORM (DCT) is a widely

DISCRETE COSINE TRANSFORM (DCT) is a widely IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL 20, NO 4, APRIL 2012 655 A High Performance Video Transform Engine by Using Space-Time Scheduling Strategy Yuan-Ho Chen, Student Member,

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics

Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Implementation of FFT Processor using Urdhva Tiryakbhyam Sutra of Vedic Mathematics Yojana Jadhav 1, A.P. Hatkar 2 PG Student [VLSI & Embedded system], Dept. of ECE, S.V.I.T Engineering College, Chincholi,

More information

FPGA Implementation of FFT Processor in Xilinx

FPGA Implementation of FFT Processor in Xilinx Volume-6, Issue-2, March-April 2016 International Journal of Engineering and Management Research Page Number: 134-138 FPGA Implementation of FFT Processor in Xilinx Anup Tiwari 1, Dr. Samir Pandey 2 1

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

Fixed Point Streaming Fft Processor For Ofdm

Fixed Point Streaming Fft Processor For Ofdm Fixed Point Streaming Fft Processor For Ofdm Sudhir Kumar Sa Rashmi Panda Aradhana Raju Abstract Fast Fourier Transform (FFT) processors are today one of the most important blocks in communication systems.

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

FAST Fourier transform (FFT) is an important signal processing

FAST Fourier transform (FFT) is an important signal processing IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 54, NO. 4, APRIL 2007 889 Balanced Binary-Tree Decomposition for Area-Efficient Pipelined FFT Processing Hyun-Yong Lee, Student Member,

More information

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA

FPGA Implementation of 16-Point Radix-4 Complex FFT Core Using NEDA FPGA Implementation of 16-Point FFT Core Using NEDA Abhishek Mankar, Ansuman Diptisankar Das and N Prasad Abstract--NEDA is one of the techniques to implement many digital signal processing systems that

More information

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.

More information

FFT/IFFTProcessor IP Core Datasheet

FFT/IFFTProcessor IP Core Datasheet System-on-Chip engineering FFT/IFFTProcessor IP Core Datasheet - Released - Core:120801 Doc: 130107 This page has been intentionally left blank ii Copyright reminder Copyright c 2012 by System-on-Chip

More information

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2 IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 05, 2015 ISSN (online): 2321-0613 VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila

More information

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR

STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR STUDY OF A CORDIC BASED RADIX-4 FFT PROCESSOR 1 AJAY S. PADEKAR, 2 S. S. BELSARE 1 BVDU, College of Engineering, Pune, India 2 Department of E & TC, BVDU, College of Engineering, Pune, India E-mail: ajay.padekar@gmail.com,

More information

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith

FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith FPGA Based Design and Simulation of 32- Point FFT Through Radix-2 DIT Algorith Sudhanshu Mohan Khare M.Tech (perusing), Dept. of ECE Laxmi Naraian College of Technology, Bhopal, India M. Zahid Alam Associate

More information

Novel design of multiplier-less FFT processors

Novel design of multiplier-less FFT processors Signal Processing 8 (00) 140 140 www.elsevier.com/locate/sigpro Novel design of multiplier-less FFT processors Yuan Zhou, J.M. Noras, S.J. Shepherd School of EDT, University of Bradford, Bradford, West

More information

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work: 1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second

More information

2 Assoc Prof, Dept of ECE, RGM College of Engineering & Technology, Nandyal, AP-India,

2 Assoc Prof, Dept of ECE, RGM College of Engineering & Technology, Nandyal, AP-India, ISSN 2319-8885 Vol.03,Issue.27 September-2014, Pages:5486-5491 www.ijsetr.com MDC FFT/IFFT Processor with 64-Point using Radix-4 Algorithm for MIMO-OFDM System VARUN REDDY.P 1, M.RAMANA REDDY 2 1 PG Scholar,

More information

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays

Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Computing the Discrete Fourier Transform on FPGA Based Systolic Arrays Chris Dick School of Electronic Engineering La Trobe University Melbourne 3083, Australia Abstract Reconfigurable logic arrays allow

More information

VLSI IMPLEMENTATION AND PERFORMANCE ANALYSIS OF EFFICIENT MIXED-RADIX 8-2 FFT ALGORITHM WITH BIT REVERSAL FOR THE OUTPUT SEQUENCES.

VLSI IMPLEMENTATION AND PERFORMANCE ANALYSIS OF EFFICIENT MIXED-RADIX 8-2 FFT ALGORITHM WITH BIT REVERSAL FOR THE OUTPUT SEQUENCES. VLSI IMPLEMENTATION AND PERFORMANCE ANALYSIS OF EFFICIENT MIXED-RADIX 8-2 ALGORITHM WITH BIT REVERSAL FOR THE OUTPUT SEQUENCES. M. MOHAMED ISMAIL Dr. M.J.S RANGACHAR Dr.Ch. D. V. PARADESI RAO (Research

More information

Power Spectral Density Computation using Modified Welch Method

Power Spectral Density Computation using Modified Welch Method IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 4 October 2015 ISSN (online): 2349-784X Power Spectral Density Computation using Modified Welch Method Betsy Elina Thomas

More information

Fast Fourier Transform Architectures: A Survey and State of the Art

Fast Fourier Transform Architectures: A Survey and State of the Art Fast Fourier Transform Architectures: A Survey and State of the Art 1 Anwar Bhasha Pattan, 2 Dr. M. Madhavi Latha 1 Research Scholar, Dept. of ECE, JNTUH, Hyderabad, India 2 Professor, Dept. of ECE, JNTUH,

More information

IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS

IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS IMPLEMENTATION OF OPTIMIZED 128-POINT PIPELINE FFT PROCESSOR USING MIXED RADIX 4-2 FOR OFDM APPLICATIONS K. UMAPATHY, Research scholar, Department of ECE, Jawaharlal Nehru Technological University, Anantapur,

More information

Research Article Radix-2 α /4 β Building Blocks for Efficient VLSI s Higher Radices Butterflies Implementation

Research Article Radix-2 α /4 β Building Blocks for Efficient VLSI s Higher Radices Butterflies Implementation VLSI Design Volume 24, Article ID 69594, 3 pages http://dxdoiorg/55/24/69594 Research Article Radix-2 α /4 β Building Blocks for Efficient VLSI s Higher Radices Butterflies Implementation Marwan A Jaber

More information

High-Speed and Low-Power Split-Radix FFT

High-Speed and Low-Power Split-Radix FFT 864 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 3, MARCH 2003 High-Speed and Low-Power Split-Radix FFT Wen-Chang Yeh and Chein-Wei Jen Abstract This paper presents a novel split-radix fast Fourier

More information

AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE

AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE AREA-DELAY EFFICIENT FFT ARCHITECTURE USING PARALLEL PROCESSING AND NEW MEMORY SHARING TECHNIQUE Yousri Ouerhani, Maher Jridi, Ayman Alfalou To cite this version: Yousri Ouerhani, Maher Jridi, Ayman Alfalou.

More information

Efficient Methods for FFT calculations Using Memory Reduction Techniques.

Efficient Methods for FFT calculations Using Memory Reduction Techniques. Efficient Methods for FFT calculations Using Memory Reduction Techniques. N. Kalaiarasi Assistant professor SRM University Kattankulathur, chennai A.Rathinam Assistant professor SRM University Kattankulathur,chennai

More information

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic

More information

A BSD ADDER REPRESENTATION USING FFT BUTTERFLY ARCHITECHTURE

A BSD ADDER REPRESENTATION USING FFT BUTTERFLY ARCHITECHTURE A BSD ADDER REPRESENTATION USING FFT BUTTERFLY ARCHITECHTURE MD.Hameed Pasha 1, J.Radhika 2 1. Associate Professor, Jayamukhi Institute of Technogical Sciences, Narsampet, India 2. Department of ECE, Jayamukhi

More information

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT Design of Delay Efficient Arithmetic Based Split Radix FFT Nisha Laguri #1, K. Anusudha *2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India

More information

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm AMSE JOURNALS-AMSE IIETA publication-2017-series: Advances B; Vol. 60; N 2; pp 332-337 Submitted Apr. 04, 2017; Revised Sept. 25, 2017; Accepted Sept. 30, 2017 FPGA Implementation of Discrete Fourier Transform

More information

Variable Length Floating Point FFT Processor Using Radix-2 2 Butterfly Elements P.Augusta Sophy

Variable Length Floating Point FFT Processor Using Radix-2 2 Butterfly Elements P.Augusta Sophy Variable Length Floating Point FFT Processor Using Radix- Butterfly Elements P.Augusta Sophy #, R.Srinivasan *, J.Raja $3, S.Anand Ganesh #4 # School of Electronics, VIT University, Chennai, India * Department

More information

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR

ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR POZNAN UNIVE RSITY OF TE CHNOLOGY ACADE MIC JOURNALS No 80 Electrical Engineering 2014 Robert SMYK* Maciej CZYŻAK* ON CONFIGURATION OF RESIDUE SCALING PROCESS IN PIPELINED RADIX-4 MQRNS FFT PROCESSOR Residue

More information

An Area-Efficient BIRA With 1-D Spare Segments

An Area-Efficient BIRA With 1-D Spare Segments 206 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 1, JANUARY 2018 An Area-Efficient BIRA With 1-D Spare Segments Donghyun Kim, Hayoung Lee, and Sungho Kang Abstract The

More information

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract:

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-6, Issue-8) Abstract: International Journal of Emerging Research in Management &Technology Research Article August 27 Design and Implementation of Fast Fourier Transform (FFT) using VHDL Code Akarshika Singhal, Anjana Goen,

More information

FIR Filter Architecture for Fixed and Reconfigurable Applications

FIR Filter Architecture for Fixed and Reconfigurable Applications FIR Filter Architecture for Fixed and Reconfigurable Applications Nagajyothi 1,P.Sayannna 2 1 M.Tech student, Dept. of ECE, Sudheer reddy college of Engineering & technology (w), Telangana, India 2 Assosciate

More information

International Journal of Multidisciplinary Research and Modern Education (IJMRME) ISSN (Online): ( Volume I, Issue

International Journal of Multidisciplinary Research and Modern Education (IJMRME) ISSN (Online): (  Volume I, Issue MDF ARCHITECTURE DESIGN FOR RADIX 2K FFT D. Sathyakala* & S. Nithya** * Assistant Professor, Department of ECE, Dhanalakshmi Srinivasan Engineering College, Perambalur, Tamilnadu ** Assistant Professor,

More information

Efficient Radix-4 and Radix-8 Butterfly Elements

Efficient Radix-4 and Radix-8 Butterfly Elements Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721,

More information

Speed Optimised CORDIC Based Fast Algorithm for DCT

Speed Optimised CORDIC Based Fast Algorithm for DCT GRD Journals Global Research and Development Journal for Engineering International Conference on Innovations in Engineering and Technology (ICIET) - 2016 July 2016 e-issn: 2455-5703 Speed Optimised CORDIC

More information

Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores

Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores Lenart, Thomas; Öwall, Viktor Published in: IEEE Transactions on Very Large Scale Integration (VLSI) Systems DOI: 10.1109/TVLSI.2006.886407

More information

Implementation of Low-Memory Reference FFT on Digital Signal Processor

Implementation of Low-Memory Reference FFT on Digital Signal Processor Journal of Computer Science 4 (7): 547-551, 2008 ISSN 1549-3636 2008 Science Publications Implementation of Low-Memory Reference FFT on Digital Signal Processor Yi-Pin Hsu and Shin-Yu Lin Department of

More information

ISSN: [Kavitha* et al., (6): 3 March-2017] Impact Factor: 4.116

ISSN: [Kavitha* et al., (6): 3 March-2017] Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY REVIEW PAPER ON EFFICIENT VLSI AND FAST FOURIER TRANSFORM ARCHITECTURES Kavitha MV, S.Ranjitha, Dr Suresh H N *Research scholar,

More information

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code COPY RIGHT 2018IJIEMR.Personal use of this material is permitted. Permission from IJIEMR must be obtained for all other uses, in any current or future media, including reprinting/republishing this material

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

A Comparison of Two Algorithms Involving Montgomery Modular Multiplication

A Comparison of Two Algorithms Involving Montgomery Modular Multiplication ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology An ISO 3297: 2007 Certified Organization Volume 6, Special Issue 5,

More information

Design and Implementation of 3-D DWT for Video Processing Applications

Design and Implementation of 3-D DWT for Video Processing Applications Design and Implementation of 3-D DWT for Video Processing Applications P. Mohaniah 1, P. Sathyanarayana 2, A. S. Ram Kumar Reddy 3 & A. Vijayalakshmi 4 1 E.C.E, N.B.K.R.IST, Vidyanagar, 2 E.C.E, S.V University

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Multiplierless Processing Element for Non-Power-of-Two FFTs

Multiplierless Processing Element for Non-Power-of-Two FFTs Multiplierless Processing Element for Non-Power-of-Two FFTs Fahad Qureshi, Anastasia Volkova, Thibault Hilaire, Jarmo Takala To cite this version: Fahad Qureshi, Anastasia Volkova, Thibault Hilaire, Jarmo

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

A DCT Architecture based on Complex Residue Number Systems

A DCT Architecture based on Complex Residue Number Systems A DCT Architecture based on Complex Residue Number Systems J. RAMÍREZ (), A. GARCÍA (), P. G. FERNÁNDEZ (3), L. PARRILLA (), A. LLORIS () () Dept. of Electronics and () Dept. of Computer Sciences (3) Dept.

More information

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver E.Kanniga 1, N. Imocha Singh 2,K.Selva Rama Rathnam 3 Professor Department of Electronics and Telecommunication, Bharath

More information

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 3,800 116,000 120M Open access books available International authors and editors Downloads Our

More information

LOW-DENSITY parity-check (LDPC) codes, which are defined

LOW-DENSITY parity-check (LDPC) codes, which are defined 734 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 9, SEPTEMBER 2009 Design of a Multimode QC-LDPC Decoder Based on Shift-Routing Network Chih-Hao Liu, Chien-Ching Lin, Shau-Wei

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

DESIGN OF AN FFT PROCESSOR

DESIGN OF AN FFT PROCESSOR 1 DESIGN OF AN FFT PROCESSOR Erik Nordhamn, Björn Sikström and Lars Wanhammar Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract In this paper we present a structured

More information

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS American Journal of Applied Sciences 11 (4): 558-563, 2014 ISSN: 1546-9239 2014 Science Publication doi:10.3844/ajassp.2014.558.563 Published Online 11 (4) 2014 (http://www.thescipub.com/ajas.toc) PERFORMANCE

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

Low Power Floating-Point Multiplier Based On Vedic Mathematics

Low Power Floating-Point Multiplier Based On Vedic Mathematics Low Power Floating-Point Multiplier Based On Vedic Mathematics K.Prashant Gokul, M.E(VLSI Design), Sri Ramanujar Engineering College, Chennai Prof.S.Murugeswari., Supervisor,Prof.&Head,ECE.,SREC.,Chennai-600

More information

A Pipelined Fused Processing Unit for DSP Applications

A Pipelined Fused Processing Unit for DSP Applications A Pipelined Fused Processing Unit for DSP Applications Vinay Reddy N PG student Dept of ECE, PSG College of Technology, Coimbatore, Abstract Hema Chitra S Assistant professor Dept of ECE, PSG College of

More information

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter

Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter M. Tirumala 1, Dr. M. Padmaja 2 1 M. Tech in VLSI & ES, Student, 2 Professor, Electronics and

More information

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL Tharanidevi.B 1, Jayaprakash.R 2 Assistant Professor, Dept. of ECE, Bharathiyar Institute of Engineering for Woman, Salem, TamilNadu,

More information

Reconfigurable FFT Processor A Broader Perspective Survey

Reconfigurable FFT Processor A Broader Perspective Survey Reconfigurable FFT Processor A Broader Perspective Survey V.Sarada 1, T.Vigneswaran 2 1 ECE, SRM University, Chennai, India. saradasaran@gmail.com 2 ECE, VIT University, Chennai, India. vigneshvlsi@gmail.com

More information

SCALABLE IMPLEMENTATION SCHEME FOR MULTIRATE FIR FILTERS AND ITS APPLICATION IN EFFICIENT DESIGN OF SUBBAND FILTER BANKS

SCALABLE IMPLEMENTATION SCHEME FOR MULTIRATE FIR FILTERS AND ITS APPLICATION IN EFFICIENT DESIGN OF SUBBAND FILTER BANKS SCALABLE IMPLEMENTATION SCHEME FOR MULTIRATE FIR FILTERS AND ITS APPLICATION IN EFFICIENT DESIGN OF SUBBAND FILTER BANKS Liang-Gee Chen Po-Cheng Wu Tzi-Dar Chiueh Department of Electrical Engineering National

More information

User Manual for FC100

User Manual for FC100 Sundance Multiprocessor Technology Limited User Manual Form : QCF42 Date : 6 July 2006 Unit / Module Description: IEEE-754 Floating-point FPGA IP Core Unit / Module Number: FC100 Document Issue Number:

More information

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation High-Performance 16-Point Complex FFT April 8, 1999 Application Note This document is (c) Xilinx, Inc. 1999. No part of this file may be modified, transmitted to any third party (other than as intended

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGAL PROCESSIG UT-FRBA www.electron.frba.utn.edu.ar/dplab UT-FRBA Frequency Analysis Fast Fourier Transform (FFT) Fast Fourier Transform DFT: complex multiplications (-) complex aditions

More information

High Performance Pipelined Design for FFT Processor based on FPGA

High Performance Pipelined Design for FFT Processor based on FPGA High Performance Pipelined Design for FFT Processor based on FPGA A.A. Raut 1, S. M. Kate 2 1 Sinhgad Institute of Technology, Lonavala, Pune University, India 2 Sinhgad Institute of Technology, Lonavala,

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information