A ew Approach to Pipeline FFT Proceor Shouheng He and Mat Torkelon Department of Applied Electronic, Lund Univerity S- Lund, SWEDE email: he@tde.lth.e; torkel@tde.lth.e Abtract A new VLSI architecture for real-time pipeline FFT proceor i propoed. A hardware oriented radix- algorithm i derived by integrating a twiddle factor decompoition technique in the divide and conquer approach. Radix- algorithm ha the ame multiplicative complexity a radix- algorithm, but retain the butterfly tructure of radix- algorithm. The ingle-path delay-feedback architecture i ued to exploit the patial regularity in ignal flow graph of the algorithm. For length- DFT computation, the hardware requirement of the propoed architecture i minimal on both dominant component: log complex multiplier and complex data memory. The validity and efficiency of the architecture have been verified by imulation in hardware decription language VHDL. I. ITRODUCTIO Pipeline FFT proceor i a pecified cla of proceor for DFT computation utilizing fat algorithm. It i characterized with real-time, non-topping proceing a the data equence paing the proceor. It i an AT non-optimal approach with AT = O( 3 ), ince the area lower bound i O(). However, a it ha been peculated [] that for real-time proceing whether a new metric hould be introduced ince it i necearily non-optimal given the time complexity of O(). Although aymptotically almot all the feaible architecture have reached the area lower bound [], the cla of pipeline FFT proceor ha probably the mallet contant factor among the approache that meet the time requirement, due to it leat number, O(log ), of Arithmetic Element (AE). The difference come from the fact that an AE, epecially the multiplier, take much larger area than a regiter in digital VLSI implementation. It i alo intereting to note the at leat Ω(log ) AE are neceary to meet the real-time proceing requirement due to the computational complexity of Ω( log ) for FFT algorithm. Thu it ha the nature of lower bound for AE requirement. Any optimal architecture for real-time proceing will likely have Ω(log ) AE. Another major area/energy conumption of the FFT proceor come from the memory requirement to buffer the input data and the intermediate reult for the computation. For large ize tranform, thi turn out to be dominating [3, ]. Although there i no formal proof, the area lower bound indicate that the the lower bound for the number of regiter i likelytobeω(). Thi i obviouly true for any architecture implementing FFT baed algorithm, ince the butterfly at firt tage ha to take data element eparated =r ditance away from the input equence, where r i a mall contant integer, or the radix. Putting above argument together, a pipeline FFT proceor ha necearily Ω(log r ) AE and Ω() complex word regiter. The optimal architecture ha to be the one that reduce the contant factor, or the abolute number of AE (multiplier and adder) and memory ize, to the minimum. In thi paper a new approach for real-time pipeline FFT proceor, the Radix- Single-path Delay Feedback,or R SDF architecture will be preented. We will begin with a brief review of previou approache. A hardware oriented radix- algorithm i then developed by integrating a twiddle factor decompoition technique in divide and conquer approach to form a patially regular ignal flow graph (SFG). Mapping the algorithm to the cacading delay feedback tructure lead to the the propoed architecture. Finally we conclude with a comparion of hardware requirement of R SDF and everal other popular pipeline architecture. II. PIPELIE FFT PROCESSOR ARCHITECTURES Before going into detail of the new approach, it i beneficial to have a brief review of the variou architecture for pipeline FFT proceor. To avoid being influenced by the equence order, we aume that the real-time proceing tak only require the input equence to be in normal order, and the output i allowed to be in digit-revered (radix- or radix-) order, which i permiible in uch application uch a DFT baed communication ytem [5]. We alo tick to the Decimation- In-Frequency (DIF) type of decompoition throughout the dicuion. The architecture deign for pipeline FFT proceor had been the ubject of intenive reearch a early a in 7 when
real-time proceing wa demanded in uch application a radar ignal proceing [6], well before the VLSI technology had advanced to the level of ytem integration. Several architecture have been propoed over the lat decade ince then, along with the increaing interet and the leap forward of the technology. Here different approache will be put into functional block with unified terminology, where the additive butterfly ha been eparated from multiplier to how the hardware requirement ditinctively, a in Fig.. The control and twiddle factor reading mechanim have been alo omitted for clarity. All data and arithmetic operation are complex, and a contraint that i a power of applie. butterfly unit and multiplier a in RMDC approach, but with much reduced memory requirement: regiter. It memory requirement i minimal. RSDF: Radix- Single-path Delay Feedback [] wa propoed a a radix- verion of RSDF, employing CORDIC iteration. The utilization of multiplier ha been increaed to 75% due to the torage of 3 out of radix- butterfly output. However, the utilization of the radix- butterfly, which i fairly complicated and contain at leat complex adder, i dropped to only 5%. It require log multiplier, log full radix- butterflie and torage of ize. C C BF C BF C BF j C BF (i). RMDC(=6) j BF BF BF BF (ii). RSDF(=6) 3x6 3x6 3x 3x BF BF BF BF (iii). RSDF(=56) 9 3 6 3 BF C BF C BF C BF 6 3 6 3 (iv). RMDC(=56) RMDC: Radix- Multi-path Delay Commutator [6] i a radix- verion of RMDC. It ha been ued a the architecture for the initial VLSI implementation of pipeline FFT proceor [3] and maive wafer cale integration [9]. However, it uffer from low, 5%, utilization of all component, which can be compenated only in ome pecial application where four FFT are being proceed imultaneouly. It require 3 log multiplier, log full radix- butterflie and 5= regiter. RSDC: Radix- Single-path Delay Commutator [] ue a modified radix- algorithm with programable = radix- butterflie to achieve higher, 75% utilization of multiplier. A combined Delay-Commutator alo reduce the memory requirement to from 5=, that of RMDC. The butterfly and delay-commutator become relatively complicated due to programmability requirement. RSDC ha been ued recently in building the larget ever ingle chip pipeline FFT proceor for HDTV application []. DC6x6 BF DC6x6 BF DC6x BF DC6x (v). RSDC(=56) Figure : Variou cheme for pipeline FFT proceor RMDC: Radix- Multi-path Delay Commutator [6] wa probably the mot traightforward approach for pipeline implementation of radix- FFT algorithm. The input equence ha been broken into two parallel data tream flowing forward, with correct ditance between data element entering the butterfly cheduled by proper delay. Both butterflie and multiplier are in 5% utilization. log multiplier, log radix- butterflie and 3= regiter (delay element) are required. RSDF: Radix- Single-path Delay Feedback [7] ue the regiter more efficiently by toring the butterfly output in feedback hift regiter. A ingle data tream goe through the multiplier at every tage. It ha ame number of BF A wift kimming through of the architecture lited above reveal the ditinctive merit of the different approache: Firt, the delay-feedback approache are alway more efficient than correponding delay-commutator approache in term of memory utilization ince the tored butterfly output can be directly ued by the multiplier. Second, radix- algorithm baed ingle-path architecture have higher multiplier utilization, however, radix- algorithm baed architecture have impler butterflie which are better utilized. The new approach developed in following ection i highly motivated by thee obervation. III. RADI- DIF FFT ALGORITHM By the obervation made in lat ection the mot deirable hardware oriented algorithm will be that it ha the ame number of non-trivial multiplication at the ame poition in the SFG a of radix- algorithm, but ha the ame butterfly tructure a that of radix- algorithm. Strictly peaking, algorithm with thi feature i not completely new. An SFG The Coordinate Rotational Digital Computer
with a complex bia factor had been obtained implicitly a the reult of contant-rotation/compenation procedure uing retricted CORDIC operation []. Another algorithm combining radix- and radix- + in DIT form ha been ued to decreae the caling error in RMDC architecture, without altering the multiplier requirement []. The clear derivation of the algorithm in DIF form with perception of reducing the hardware requirement in the context pipeline FFT proceor i, however, yet to be developed. To avoid confuing with the well known radix-= plit radix algorithm and the mixed radix- + algorithm, the notion of radix- algorithm i ued to clearly reflect the tructural relation with radix- algorithm and the identical computational requirement with radix- algorithm. The DFT of ize i defined by (k)= x(n)w nk n= k< () where W denote the th primitive root of unity, with it exponent evaluated modulo. To make the derivation of the new algorithm clearer, conider the firt tep of decompoition in the radix- DIF FFT together. Applying a 3-dimenional linear index map, n = < n + n + n 3 > k = <k +k +k 3 > () the Common Factor Algorithm (CFA) ha the form of (k + k + k 3 ) = = n 3 =n =n = n 3 =n = x( n + n +n 3 )W ( n + n +n 3 )(k +k +k 3 ) fb k ( n + ( n 3 )W n +n 3 )k ( gw n +n 3 )(k +k 3 ) where the butterfly tructure ha the form of B k ( n + n 3 )=x( n +n 3 )+( ) k x( n +n 3 + ) If the expreion within the brace of eqn. (3) i to be computed before further decompoition, an ordinary radix- DIF FFT reult. The key idea of the new algorithm i to proceed the econd tep decompoition to the remaining DFT coefficient, including the twiddle factor W ( n +n 3 )k,to exploit the exceptional value in multiplication before the next butterfly i contructed. Decompoing the compoite twiddle factor and oberve that W ( n +n 3 )(k +k +k 3 ) = W n k 3 W n (k +k ) W n 3(k +k ) W n 3k 3 =( j) n (k +k ) n W 3 (k +k ) W n 3k 3 (3) () Subtituting eqn. () in eqn. (3) and expand the ummation with index n. After implification we have a et of DFT of length =, (k + k + k 3 )= n 3 = h H(k ;k ;n 3 )W n 3(k +k ) where H(k ;k ;n 3 )i expreed in eqn. (6). x() x() x() x(3) x() x(5) x(6) x(7) x() x(9) x() x() x() x(3) x() x(5) W W W W W W W W9 / DFT (k=, k=) / DFT (k=, k=) / DFT (k=, k=) / DFT (k=, k=) i W n 3k 3 Figure : Butterfly with decompoed twiddle factor. (5) () () () () () () (6) () () (9) (5) (3) (3) () (7) (5) eqn. (6) repreent the firt two tage of butterflie with only trivial multiplication in the SFG, a and in Fig.. After thee two tage, full multiplier are required to compute the product of the decompoed twiddle factor W n 3(k +k ) in eqn. (5), a hown in Fig.. ote the order of the twiddle factor i different from that of radix- algorithm. Applying thi CFA procedure recurively to the remaining DFT of length = in eqn. (5), the complete radix- DIF FFT algorithm i obtained. An = 6 example i hown in Fig. 3 where mall diamond repreent trivial multiplication by W = = j, which involve only real-imaginary wapping and ign inverion. Radix- algorithm ha the feature that it ha the ame multiplicative complexity a radix- algorithm, but till retain the radix- butterfly tructure. The multiplicative operation are in a uch an arrangement that only every other tage ha non-trivial multiplication. Thi i a great tructural advantage over other algorithm when pipeline/cacade FFT architecture i under conideration. IV. R SDF ARCHITECTURE Mapping radix- DIF FFT algorithm derived in lat ection to the RSDF architecture dicued in ection II., a new architecture of Radix- Single-path Delay Feedback (R SDF) approach i obtained.
z h } { H(k ;k ;n 3 )= z } { x(n 3 )+( ) k x(n 3 + )i +( j) (k +k ) hx(n 3 + )+( )k x(n 3 + 3 )i {z } (6) x() x() x() x(3) x() x(5) x(6) x(7) x() x(9) x() x() x() x(3) x() x(5) W W W W W9 I V Figure 3: Radix- DIF FFT flow graph for = 6 () () () () () () (6) () () (9) (5) (3) (3) () (7) (5) Fig. 5 outline an implementation of the R SDF architecture for = 56, note the imilarity of the data-path to RSDF and the reduced number of multiplier. The implementation ue two type of butterflie, one identical to that in RSDF, the other contain alo the logic to implement the trivial twiddle factor multiplication, a hown in Fig. -(i)(ii) repectively. Due to the patial regularity of Radix- algorithm, the ynchronization control of the proceor i very imple. A (log )-bit binary counter erve two purpoe: ynchronization controller and addre counter for twiddle factor reading in each tage. With the help of the butterfly tructure hown in Fig., the cheduled operation of the R SDFproceorinFig.5 i a follow. On firt = cycle, the -to- multiplexor in the firt butterfly module witch to poition, and the butterfly i idle. The input data from left i directed to the hift regiter until they are filled. On next = cycle, the multiplexor turn to poition, the butterfly compute a -point DFT with incoming data and the data tored in the hift regiter. Z(n) = x(n)+ x(n+ =), n<= (7) Z(n + =) = x(n) x(n + =) The butterfly output Z(n) i ent to apply the twiddle factor, and Z(n + =) i ent back to the hift regiter to be multiplied in till next = cyclewhen the firt half ofthe next frame of time equence i loaded in. The operation of the econd butterfly i imilar to that of the firt one, except the ditance of butterfly input equence are jut = andthe trivial twiddle factor multiplication ha been implemented by real-imaginary wapping with a commutator and controlled add/ubtract operation, a in Fig. -(ii), which require two bit control ignal from the ynchronizing counter. The data then goe through a full complex multiplier, working at 75% utility, accomplihe the reult of firt level of radix- DFT word by word. Further proceing repeat thi pattern with the ditance of the input data decreae by half at each conecutive butterfly tage. After clockcycle, The complete DFT tranform reult tream out to the right, in bit-revered order. The next frame of tranform can be computed without pauing due to the pipelined proceing of each tage. xr(n) xi(n) xr(n+/) xi(n+/) xr(n) xi(n) xr(n+/) xi(n+/) (i). BFI (ii). BFII t Zr(n+/) Zi(n+/) Zr(n) Zi(n) Zr(n+/) Zi(n+/) Figure : Butterfly tructure for R SDF FFT proceor In practical implementation, pipeline regiter hould be inerted betweeneach multiplier and butterfly tage to improve the performance. Shimming regiter are alo needed for control ignal to comply with thu revied timing. The latency of the output i then increaed to +3(log ) without affecting the throughput rate. V. COCLUSIO In thi paper, a hardware-oriented radix- algorithm i derived which ha the radix- multiplicative complexity but retain radix- butterfly tructure in the SFG. Baed on thi algorithm, a new, efficient pipeline FFT architecture, the R SDF architecture, i put forward. The hardware requirement of propoed architecture a compared with variou approache i hown in Table, where not only the number of complex Zr(n) Zi(n)
6 3 6 x(n) BFI BFII t BFI BFII BFI BFII BFI BFII t t t (k) clk W(n) W(n) (n) 7 6 5 3 Figure 5: R SDF pipeline FFT architecture for = 56 multiplier, adder and memory ize but alo the control complexity are lited for comparion. For eay reading, bae- logarithm i ued whenever applicable. It how R SDF ha reached the minimum requirement for both multiplier and the torage, and only econd to RSDC for adder. Thi make it an ideal architecture for VLSI implementation of pipeline FFT proceor. Table : Hardware requirement comparion multiplier # adder # memory ize control RMDC (log ) log 3= imple RSDF (log ) log imple RSDF log log medium RMDC 3(log ) log 5= imple RSDC log 3log complex R SDF log log imple The architecture ha been modeled with hardware decription language VHDL with generic parameter for tranform ize and word-length, uing fixed point arithmetic and a complex array multiplier implemented with ditributed arithmetic. The validity and efficiency of the propoed architecture ha been verified by extenive imulation. [7] E.H. Wold and A.M. Depain. Pipeline and parallel-pipeline FFT proceor for VLSI implementation. IEEE Tran. Comput., C-33(5): 6, May 9. [] A.M. Depain. Fourier tranform computer uing CORDIC iteration. IEEE Tran. Comput., C-3():993, Oct. 97. [9] E. E. Swartzlander, V. K. Jain, and H. Hikawa. A radix wafer cale FFT proceor. J. VLSI Signal Proceing, (,3):65 76, May 99. [] G. Bi and E. V. Jone. A pipelined FFT proceor for wordequential data. IEEE Tran. Acout., Speech, Signal Proceing, 37():9 95, Dec. 99. [] A.M. Depain. Very fat Fourier tranform algorithm hardware for implementation. IEEE Tran. Comput., C-(5):333 3, May 979. [] R. Storn. Radix- FFT-pipeline architecture with raduced noie-to-ignal ratio. IEE Proc.-Vi. Image Signal Proce., (): 6, Apr. 99. REFERECES [] C. D. Thompon. Fourier tranform in VLSI. IEEE Tran. Comput., C-3():7 57, ov. 93. [] S. He and M. Torkelon. A new expandable D ytolic array for DFT computation baed on ymbioi of D array. In Proc. ICA 3 PP 95, page 9, Bribane, Autralia, Apr. 995. [3] E. E. Swartzlander, W. K. W. Young, and S. J. Joeph. A radix delay commutator for fat Fourier tranform proceor implementation. IEEE J. Solid-State Circuit, SC-9(5):7 79, Oct. 9. [] E. Bidet, D. Catelain, C. Joanblanq, and P. Stenn. A fat ingle-chip implementation of 9 complex point FFT. IEEE J. Solid-State Circuit, 3(3):3 35, Mar. 995. [5] M. Alard and R. Laalle. Principle of modulation and channel coding for digital broadcating for mobile receiver. EBU Review, ():7 69, Aug. 97. [6] L.R. Rabiner and B. Gold. Theory and Application of Digital Signal Proceing. Prentice-Hall, Inc., 975.