Discrete Fourier Transform Compiler: From Mathematical Representation to Efficient Hardware

Size: px

Start display at page:

Download "Discrete Fourier Transform Compiler: From Mathematical Representation to Efficient Hardware"

Lambert Singleton
5 years ago
Views:

1 Discrete Fourier Trasform Compiler: From Mathematical Represetatio to Efficiet Hardware Peter A. Milder, Fraz Frachetti, James C. Hoe, ad Markus Püschel Electrical ad Computer Egieerig Departmet Caregie Mello Uiversity Pittsburgh, PA, U.S.A. {pam, frazf, jhoe, Abstract A wide rage of hardware implemetatios are possible for the discrete Fourier trasform (DFT), offerig differet tradeoffs i throughput, latecy ad cost. The well-uderstood structure of DFT algorithms makes possible a fully automatic sythesis framework that ca spa the viable iterestig desig choices. I this paper, we preset such a sythesis framework that starts from formal mathematical formulas of a geeral class of fast DFT algorithms ad produces performace ad cost efficiet sequetial hardware implemetatios, makig desig decisios ad tradeoffs accordig to user specified high-level prefereces. We preset evaluatios to demostrate the variety of supported implemetatios ad the cost/performace tradeoffs they allow. I. INTRODUCTION The discrete Fourier trasform (DFT) is oe of the ubiquitous buildig blocks i sigal processig ad other embedded processig applicatios. Its computatio exhibits a high degree of regularity i structure, comprisig recurrig basic kerels. O oe had, the theories behid efficiet hardware implemetatios have bee studied extesively ad are very well uderstood []. O the other, creatig practical implemetatios remais challegig i practice because it requires combied sophisticatio i the mathematics of trasforms as well as i digital desig. Whe a desig calls for a discrete Fourier trasform, desigers most ofte resort to istatiatig pre-desiged library implemetatios. Ready-to-use DFT modules are i the repertoire of early every techology vedor library whether ASIC or FPGAs. These library modules are desiged by specialists ad geerally attai optimum performace for the amout of resources they cosume. However optimized, these static library modules ca fall far short of optimum i a give applicatio cotext due to a mismatch i objectives. For example, a static library module would offer exactly the same level of performace regardless of how much surplus logic resources are available. To address this limitatio, Nordi et al. [2] have made available a parameterized DFT module geerator that allows cotrol over the level of hardware parallelism such that the desiger ca make custom tradeoffs betwee the performace desired ad the resources cosumed i the geerated module. Give the well-uderstood regular structure of the DFT (ad other liear DSP trasforms i geeral), oe should be able to fully capture the available desig space i a sythesis system to fully automate the geeratio of high-quality hardware implemetatios. The parameterized geeratio egie i [2] is a step i the right directio. However, this techology is limited i that the egie is hard-coded for a specific DFT algorithm (Pease [3]) ad oly exploits oe particular restructurig to derive the tradeoff i oe specific dimesio of the overall desig space. I this paper, we preset a formula-to-hardware sythesis flow that accepts as iput the mathematical represetatio of a geeral class of DFT trasform algorithms ad is capable of producig a wide rage of correct hardware implemetatios (i sythesizable RTL Verilog), icludig latecy-efficiet iterative microarchitectures ad throughput-efficiet streamig microarchitectures. The iput represetatio is based o a sparse-matrix formula laguage. Startig from pure mathematical formulas, the sythesis process comprisig a set of formal formula-level rewrite rules makes hardware implemetatio decisios ad tradeoffs accordig to user specified high-level prefereces. The outcome is a set of fully aotated formulas that ca be straightforwardly reduced to its correspodig hardware sequetial implemetatios. The wealth of iitial DFT formula choices ad the rich combiatios of structural rewrite rules together yield the large space of implemetatios attaiable by this DFT sythesis framework. Paper outlie. Sectio II itroduces the DFT trasform ad the formula laguage ad sketches a rudimetary sythesis algorithm for combiatioal implemetatios. Sectio III presets the flow from formula to hardware sythesis i two parts: first, from formula to aotated hardware formula, secod, from hardware formula to sequetial datapath. Sectio IV reviews the sythesis flow by demostratig two workig examples. Sectio V evaluates a wide rage of DFT desig istaces produced by our framework. Sectio VI discusses prior work i hardware DFT implemetatios. Fially, Sectio VII offers our discussio ad coclusios. II. BACKGROUND Discrete Fourier trasform. The discrete Fourier trasform (DFT) of size is the matrix-vector multiplicatio y = DFT x,

2 where x ad y are the iput ad output vectors of legth, ad DFT = [ω kl ] k,l<, ω = exp( 2πi/), i =. I this paper we oly cosider two-power sizes. Fast Fourier trasforms. Computig the DFT of x by matrixvector multiply requires O ( 2) operatios. However, this ca be reduced to O ( log() ) usig well-kow fast algorithms (fast Fourier trasforms or FFTs). A FFT ca be viewed as a factorizatio of DFT ito a product of sparse matrices. For example (omitted etries are zero): DFT = i i i i i = Computig DFT x by multiplyig the iput x from right to left with the four sparse matrices has a lower arithmetic cost tha multiplyig by the dese trasform matrix. Each of the occurrig sparse matrices has structure, which ca be used to express the algorithm usig the Kroecker or tesor product formalism []. For example, the factorizatio above becomes DFT = (DFT 2 I 2 )D (I 2 DFT 2 )L,2. () Here D = diag(,,,i) is a diagoal matrix (called twiddle matrix); I the idetity; L,m (for m divides ) the stride permutatio, which ca be viewed as trasposig a /mm matrix stored i a vector i row-major order, formally: i (/m) + j j m + i, for i < m, j < /m. Fially, is the Kroecker or tesor product defied as B m = [a k,l B m ] k,l<, for A = [a k,l ] k,l<. Geerally, we write to deote that A is. Formula laguage for FFTs. The above formalism ca be captured i a formal laguage that ca be used to represet FFTs usig formulas. I Backus-Naur form, the laguage is defied as follows (o-termials are bold-faced): formula ::=formula formula I k formula m where = km formula m I k where = km base base ::= D = diag(d,,d ) L,m I DFT 2 This laguage is a subset of the sigal processig laguage (SPL) used i Spiral, a program geerator for liear trasforms []. We will also refer to it as SPL i this paper. Eve though the laguage is small, a large class of differet FFTs ca be expressed with it. We provide a few examples: DFT m = (DFT I m )D,m (I DFT m )L m, (2) ( t 2 DFT r t = L r t,r (I r t DFT r ) D,k k= (I r k L r t k,r t k )(I r k+ L r t k,r) ) (3) (I r t DFT r ) R r t,r [ t ] DFT r t = L r t,r (I r t DFT r ) D,k Rr rt () k= Equatio (2) is the well-kow recursive Cooley-Tukey FFT ad the stadard choice for software implemetatios. The others are iterative ad suitable for hardware implemetatios. Equatio (3) is called the iterative FFT, ad () is the Pease FFT, which has perfect regularity across stages except for the diagoal matrices D,k which deped o k. Both FFTs are give for a arbitrary radix r, where the radix idicates the size of the algorithm s basic block. Lastly, R r t,r is the radix-r reversal permutatio, kow as the bit reversal whe r = 2. Formula to combiatioal datapath. There exists a atural oe-to-oe correspodece betwee a SPL formula ad a combiatioal logic implemetatio. We demostrate it for differet formula costructs M supported by the grammar: M = B : The iput vector x first passes through a combiatioal module correspodig to B the aother module correspodig to A (Fig. (a)). M m = I m : The resultig matrix is a block diagoal matrix that cotais zero everywhere except repeated m times o the diagoal. I combiatioal logic, the vector x passes through m parallel copies of A such that each copy operates o a flit of cosecutive elemets from x (Fig. (b)). M m = I m : We ca first rewrite this matrix as L m, (I m )L m,m (a kow idetity []) ad hadle it as the product of three matrices (Fig. (d)). M = L,m : The correspodig reorderig of the vector x is achieved by reshufflig the busses that carry the elemets of x (Fig. (c)). M is diagoal: Each elemet of x is multiplied by a correspodig costat (Fig. (e)). M = DFT 2 : This icurs the computatios y = x + x ad y = x x ad yields the so-called butterfly structure i Fig. (f). Fig. shows that it would be a straightforward task to geerate combiatioal logic for ay formula i the SPL laguage. For example, for the formula i (), this compiler would produce the datapath show i Fig. (g). For geeral, the FFTs (3) ad () would yield combiatioal logic that is O ( log() ) i depth ad O ( log() ) i size. Such combiatioal implemetatios are too expesive except for small problem sizes.

3 x B (a) B y x x A 2 y y x 2 x 3 A 2 y 2 y 3 (b) I 2 A 2 x x y y x 2 y 2 x 3 y 3 (c) L,2 m m cycles x x A 2 y y x 2 x 3 A 2 y 2 y 3 (d) A 2 I 2 x x x 2 x 3 DFT 2 DFT 2 x d x d x 2 d 2 x 3 d 3 (e) D i y y y 2 y 3 DFT 2 DFT 2 x x + - (f) DFT 2 y y y 2 y 3 y y (a) No streamig reuse: I m. w m/w cycles (b) Full streamig reuse: I m sr. (w/) blocks (c) Partial streamig reuse: I m/w sr ( I w/ ). (g) DFT (from ()) Fig. 3. Examples of streamig reuse. Fig.. Examples of formulas ad associated combiatioal datapaths. DSP Trasform formula geeratio formula formula aotatio RTL geeratio RTL etlist hardware formula Fig. 2. Block diagram of desig flow. The dashed block cotais the focus of this paper. III. FROM FORMULA TO HARDWARE Our goal is to automatically geerate various sequetial implemetatios of the DFT. Our formula laguage, as explaied i Sectio II, has o sigular correspodece to sequetial hardware. I this sectio, we explai how we exted this laguage to express sequetial hardware elemets eeded for efficiet implemetatios. The, we itroduce a rewritig system which takes a formula with hardware directives ad produces a hardware descriptio formula. Lastly, we discuss the process of compilig a hardware descriptio formula to a sythesizable RTL etlist. This flow is illustrated by the diagram i Figure 2. A. Hardware Sigal Processig Laguage The datapaths associated with various DFT algorithms exhibit a high degree of regularity. This regular structure gives a opportuity to reuse portios of the datapath i two ways. I this sectio, we examie both types of reuse ad the laguage extesios eeded to support them. Streamig reuse. As see i Sectio II, the tesor product I m leads to a datapath with m data-parallel istaces of the block associated with (Fig. 3(a)). We also ca iterpret the tesor product as a idicator of parallelism i time i a streamig fashio. Rather tha havig block repeated m times i parallel, we ca build oe physical istace of it, ad reuse it over m cosecutive clock cycles (Figure 3(b)). We call this streamig reuse. I order to distiguish betwee these two meaigs of the tesor product, we ca tag the symbol sr i order to idicate streamig reuse. Additioally, it is possible to have partial streamig reuse. For example, (I A ) ca be broke dow as (I 2 sr (I 2 A )), meaig that there are two blocks i parallel, ad each operates o data over two clock cycles. A geeralized versio of this situatio ca be see i Figure 3(c). We use w to idicate the stream width. Horizotal reuse. I the previous sectio, we saw how data parallel blocks could be vertically collapsed ito oe block. Additioally, a series of idetical blocks (such as ) ca be horizotally collapsed ito oe block, as see i Figure. We call this horizotal reuse. (Notice that could be streamed as well.) I order to distiguish betwee the two meaigs of, we tag the product term hr i order to idicate horizotal reuse. It is importat to ote that the terms i a horizotal reuse

4 (a) No horizotal reuse: A. (b) Full horizotal reuse: hr A. Fig.. Example of horizotal reuse. product term caot chage from iteratio to iteratio, except i the case of diagoal matrices. For example, (3) would ot be eligible, but () would. This is explaied i detail i Sectio III-C. B. Rewritig System: From Trasform to Hardware Formula I this sectio, we describe a rewritig system that takes a formula plus hardware directives ad produces a hardware formula, which is restructured ad aotated such that it directly correspods to a sequetial hardware implemetatio. This process correspods to the formula aotatio segmet of Figure 2. A hardware directive is a tag that idicates a desired feature of the fial hardware implemetatio. I order to idicate streamig reuse, we defie a streamig tag: This tag idicates that the cotets of A should be restructured such that the resultig hardware formula will be implemeted i a block that cotais w iput ad output ports, with data streamed at w elemets per cycle. Figure 5 lists the rewritig rules that perform this trasformatio. Each time the system ecouters a stream tagged formula, it attempts to restructure the formula or propagate the tag dowward. If a tagged formula does ot match ay of these rules, the tag becomes part of the hardware formula. I these cases, the compiler (discussed i the followig sectio) must explicitly kow how to build a data structure for the tagged formula. Each of the rewrite rules give i Figure 5 has a simple explaatio: base: If the size of a matrix is the same as its stream size, the stream tag is ot ecessary ad ca be dropped. product select: This rule selects whether to do horizotal reuse or streamig. product ad product HR: If a group of matrices is tagged as streamig, the tag is propagated iward to all of the idividual matrices. This rule applies to both versios of the product term. ame rule base if k = w, product-select product product-hr A k A k A A A B Z hra hr hra if streamig if hor. reuse A B Z reuse if lk > w ad k w, I l A }{{ k I } l/w ( ) sr I w/k A k reuse2 if k > w, reverse Fig. 5. I l A }{{ k I } l sr A k A k I l L kl,k (I l A k ) L kl,l Rewritig rules for geeratig hardware formulas. reuse: If the size of A is less tha or equal to the size of the stream, the ier tesor product urolls the correct umber of A istaces such that the ier product is exactly the stream size. reuse2: If the size of A is larger tha the size of the stream, the tag is propagated iward, ad aother rule must restructure A to the right stream size. reverse: A property of the tesor product allows us to reverse its order with strided access. After rewritig, the resultig hardware formula may be made of the followig blocks: formulas (without tags), streamig-reuse tesor products, horizotal-reuse product terms, streamed diagoals, ad streamed permutatios (i.e., stride ad bit reversal permutatios): A, I l sr A, hra, D, L,m, Rr. I the followig sectio, we will discuss how the hardware formula, made of these five types of objects, is built ito a Verilog descriptio. C. Compiler: From Hardware Formula to HDL The compiler takes i a hardware formula, as defied above, ad produces a sythesizable Verilog descriptio of a circuit. I this sectio, we explai how each of the possible forms of the hardware formula is mapped.

5 cotrol address geeratio cotrol dual-port RAM Fig. 6. w ports itercoectio etwork dual-port RAM dual-port RAM itercoectio etwork Structure for permutig streamed vector elemets. Combiatioal formula. Ay portio of the formula without a reuse tag ca automatically be mapped ito a combiatioal datapath, as discussed i Sectio II. Whe the compiler ecouters this type of formula, it costructs a hardware datapath ad automatically pipelies the path by isertig stagig registers i the appropriate locatios. Specific streamig elemets. We implemet two elemets that are built directly from a stream-tagged matrix. The compiler has specific kowledge of how to geerate these blocks: Streamig diagoal: A diagoal matrix scales each elemet i the iput vector by the correspodig value from the diagoal of the matrix. I order to covert this ito a streamig hardware structure, we first eed w multipliers, where w is the stream width. The, we store the values from the diagoal i w tables, which feed the multipliers with the appropriate data at each cycle. Streamig permutatios: A streamig permutatio implemetatio must reorder data i space ad across differet clock cycles. Püschel et al. [5] prescribe a architecture ad a algorithm whereby a arbitrary permutatio ca be costructed for a arbitrary streamig width w (where both the vector legth ad stream width w are 2-powers). The costructio, sketched i Figure 6, uses w dual-ported memory baks ad cofigurable switchig etworks at the iput ad output stages. For the relevat permutatios, the cotrols for the permutatio block ca be computed cheaply from the flit umber usig oly bitwise operators. For example, Figure 7 shows a implemetatio of L 256,2 with streamig width w =. Because this method works with a geeral class of permutatios, it is able to implemet products of permutatios (e.g., L 8, (I 2 L,2 )) as oe self-cotaied permutatio module. Fially, the compiler maps formulas tagged for horizotal or streamig reuse to the appropriate structural hardware costruct: Streamig reuse: A streamig reuse tesor product I m sr is implemeted i hardware as oe block iteded for a iput vector i a streamed format (as see i Fig. 3(b). Horizotal reuse: As show i Fig. (b), a horizotal reuse structure is built with a iput multiplexer ad feedback loop. Whe the block cotais D k, a diagoal matrix that chages with the iteratio, the table streamig vector (w= words) Fig. 7. s s wa wa wa 2 wa 3 dual-port Bak dual-port Bak dual-port Bak2 dual-port Bak3 ra 3 ra ra 2 ra Example of RAM-based permutatio: L 256,2 with ports. must grow to accommodate all values. If a data vector iterates l times over this datapath, the diagoal table must grow by a factor of l. IV. EXAMPLES: FROM FORMULA TO DATAPATH We demostrate the automated sythesis flow for two differet DFT formulas. Streamig reuse. The iterative FFT, give i Equatio (3), produces a streamig reuse structure. For size 8 ad radix 2, this formula simplifies to: DFT 2 3 = L 8,2(I DFT 2 )D 8, L 8, (I 2 L,2 ) (I DFT 2 )D 8, (I 2 L,2 )(I DFT 2 ). The etire formula is the tagged as streamig with a stream size: (I DFT 2 ) D 8, L 8, (I 2 L,2 ) (I DFT 2 ) (I 2 L,2 ) (I DFT 2 ). DFT 2 = L 3 8,2 } {{ } D 8, } {{ } s 2 s 3 } {{ } The, the rewrite system described i Sectio III-B chages the formula to a hardware descriptio formula. For this example, the rewrite rules produce the followig hardware formula: DFT 2 = L 3 8,2 (I sr DFT 2 ) D 8, (I sr DFT 2 )(I 2 sr L,2 ) D 8, L 8, (I 2 L,2 ) (I sr DFT 2 ). (5) Each term i this equatio is directly traslated to a hardware datapath accordig to Sectio III-C (from the right of the formula to the left), producig the datapath see i Fig. 8. Streamig ad horizotal reuse. The Pease FFT algorithm [3] give i Equatio () produces a architecture with We implemet the DFT with the digit reversal permutatio R omitted. This is a commo iterface optio i hardware DFT implemetatios. We idicate this i the formula as DFT.

6 + - L,2 + L 8, (I 2 L,2) L 8,2 Fig. 8. Datapath implemetatio of streamig DFT 8 with stream size of 2 both horizotal ad streamig reuse. For size 6 ad radix, this formula simplifies to: DFT = L 2 6, (I DFT ) D 6,k k= DFT L 6, Next, this formula is tagged with a stream size. Additioally, the product ca be tagged for horizotal reuse, because the formula iside it oly cotais iterator k i the diagoal matrix. The formula is the coverted to a hardware formula, as i Sectio III-B. If the stream size is set to, the followig hardware formula is obtaied: DFT = 2 stream() k= hr L 6, (I sr DFT ) D 6,k stream() stream() Each term of this formula is traslated directly to a RTL etlist (readig the formula from right to left), ad the resultig datapath is show i Fig. 9. However, this datapath icludes two optimizatios that are performed i the compiler: The first optimizatio reduces the amout of arithmetic hardware that is built. Due to the structure of the diagoal matrix i the Pease algorithm, oe out of every r multipliers will always access a value of. By labellig these values as trivial costats ad givig the system some additioal arithmetic simplificatio rules, the tool is able to determie that these multipliers will always multiply by ad thus remove them. This reduces the umber of multipliers by out of every r. The secod optimizatio allows the amout of table data to be reduced by a factor of log r (). A horizotal reuse block with a diagoal matrix D,k that chages with each iteratio requires elemets to be stored for each of the log r () iteratios, leadig to a storage requiremet of log r () words. However, the diagoals i the Pease formula pose a special property: the set of all values of D,l (the diagoal values at iteratio l) is a subset of the values of D,l for l >. This meas that with the right access fuctio, all log r () data words ca be obtaied from a table of words. Whe a stream tag is applied to the Pease diagoal matrix, our system recogizes this property. The, it applies the correct access fuctio to the represetatio ad stores oly the data words correspodig to D,. The ew access fuctio is very simple, cosistig of bit-shifts ad bitwise ANDs of idices. So, the Pease storage requiremet is reduced by a factor of log r (). (6) Fig. 9. Example of Pease DFT 6 with w =. V. EVALUATION Whe coupled with a formula geerator like Spiral [], the formula sythesis flow described i this paper eables a large umber of DFT desigs to be explored quickly i a turkey fashio. This sectio evaluates the supported rage of implemetatios ad the differet cost/performace tradeoffs they provide. As we offer desigs over a wide rage of tradeoffs betwee performace ad cost, our evaluatios iclude a compariso to the Xilix LogiCore FFT implemetatio [6] to establish that the tradeoffs have a soud basis. Specifically, we select as referece three LogiCore FFT implemetatios: the radix- burst I/O implemetatio, the radix-2 miimal size implemetatio, ad the pipelied streamig I/O implemetatio, each with a scaled fixed-poit (6-bit) data format, ad atural-i bit-reverse-out data orderig. A. Methodology We implemeted the formula-to-hardware sythesis flow as a ew hardware backed to Spiral s formula geerator, which produced all the startig DFT formulas studied i this paper. The hardware-specific formula rewritig rules discussed i Sectio III-B are implemeted as a part of Spiral s formula maipulatio stage to produce the aotated hardware formulas. Fially, a RTL geerator, implemeted i Java, emits sythesizable Verilog RTL descriptios from the hardware formulas. Whe a evaluatio i this sectio reports sythesized results, the Verilog descriptios are sythesized ad placead-routed for the Xilix XC2VP-6 FPGA usig Xilix ISE 8.. We report implemetatio cost i uits of slices. All sythesized desigs use 6-bit fix-poit data format. To curtail sythesis load, we cosistetly use 7s as the target

7 clock frequecy. 2 Our RTL geerator will use a block-ram for storage if the usage will utilize more tha 5 percet of that block-ram; otherwise, distributed-ram is used istead. 3 B. Throughput Performace vs. Cost We first evaluate the tradeoff betwee cost (i slices) ad throughput performace (umber of trasforms completed per secod). For DFT 6 ad DFT 256, we evaluate implemetatios of fully-streamig Iterative FFT ad horizotal-reuse Pease FFT. For each algorithm/architecture combiatio, we explore radices 2,, ad 8. We iclude implemetatios with streamig width from w = r up to the maximum allowed by the FPGA capacity. Their cost ad throughput are reported i Figure. Throughput (y-axis) is preseted i terms of gap, the time betwee starts i the steady-state. The x-axis idicates cost i slices. The plots show separate tred lies for each combiatio of algorithm, radix, ad architecture. Each tred lie begis (left to right) with streamig width (w = r) ad doubles thereafter. I the Pareto-style plot, poits closer to the origi represet desigs that are smaller ad faster. Oly poits o the Pareto frot poits that are ot overshadowed by aother poit that is both faster ad smaller should be used i practice. It is importat to ote that the Pareto frot comprises poits arisig from differet combiatios of algorithmic ad architectural decisios. I both DFT 6 ad DFT 256, the fully-streamig implemetatios based o Iterative FFT algorithm provide the fastest (yet commesurately more expesive) desig poits. For all radix choices, the results show a icrease i throughput as more slices are cosumed (by icreasig streamig width w). Implemetatios usig larger radices geerally have better performace/cost ratios relative to comparable implemetatios based o lower radices. This is because, for the comparable choices of streamig width, all implemetatios cosistetly sythesize to comparable frequecies regardless of radix. Hece, all streamig implemetatios of DFT 6 with the same stream width should achieve comparable throughput. O the other had, for the same stream width, higher radix implemetatios have the advatage of fewer permutatio ad twiddle stages. However, the differece betwee radix 8 ad is much less oticeable tha betwee radix ad 2. The throughput evaluatio icludes horizotal-reuse implemetatios based o the Pease FFT to provide the very 2 This is methodology is acceptable because for moderate streamig width w 8, the cycle times of our DFT implemetatios are determied by critical paths i the complex arithmetic pipelie stages (cosistetly sythesize to betwee 7 ad 8 s). For the larger ad wider desigs, the sythesized frequecy iheretly becomes less predictable (typically 2 to 8 s) due to routig ad placemet effects. Overall, our methodology makes our performace results coservative as our performace could possibly improve by choosig a differet frequecy target. Whe reportig sythesis results for the Xilix LogiCore Library, we report the highest performig outcome from sythesizig their desigs over a rage of target frequecies. 3 Block-RAM are the 6-kilobit memory hard macros i the Xilix Virtex- II Pro FPGAs. Distributed-RAM cosumes 6-bits per slice. I geeral, our geerator lets the user set arbitrary switch-over poit betwee usig block- RAM vs distributed-ram. The problem size must be a power of r, the radix. Spiral geerated FFT IP Cores vs. Xilix LogiCore = 6 (top), = 256 (bottom) Iverse throughput (gap) versus area ot streamed streamed Gap [microsecods] Gap [microsecods] Area [slices] 5 5 Area [slices] Xilix LogiCore Spiral radix 2 Spiral radix Spiral radix 8 Fig.. Gap ( / throughput) versus cost for implemetatios of DFT 6 (top) ad DFT 256 (bottom). cheap but commesurately lower throughput desig poits. Gap is still measured i terms of time betwee starts of ew DFT computatios, but these horizotal-reuse implemetatios caot support cotiuous streamig of vectors. Data poits correspodig to the LogiCore FFT implemetatios are icluded i Figure. They serve as referece poits to show that our desigs are of good quality ad yield a real icrease i performace for the extra resources they cosume. C. Latecy Performace vs. Cost Next, we evaluate the tradeoff betwee cost (i slices) ad latecy performace (time elapsed for oe trasform computatio). For DFT 6, DFT 256, ad DFT 2, we evaluate implemetatios of horizotal-reuse Pease FFT oly. (Fullystreamig implemetatios are always Pareto sub-optimal i this regard because they are optimized for high throughput at the expese of exteded latecy.) We explore radices r = {2,,8} whe = 6; radices r = {2,} for = 256; ad radices r = {2,} for = 2. We iclude implemetatios from the miimum streamig width w = r up to the maximum allowed by the FPGA capacity. The cost ad latecy are

8 reported i Figure. For each desig poit, the y-axis idicates latecy i microsecods, ad the x-axis idicates cost i slices. Agai i this Pareto-style plot, poits closer to the origi are cheaper ad faster. Similar to the previous reportig format, the plots show separate tred lies for each combiatio of algorithm/radix/architecture. Each tred lie begis (left to right) with streamig width (w = r) ad doubles thereafter. For all radix choices, the horizotal-reuse implemetatios show a decrease i latecy as more resources are cosumed for wider stream width w. Agai, a large improvemet i performace/cost ratio is see for radix- relative to radix- 2 implemetatios 5, but the differece betwee radix-8 ad radix- is less sigificat. Employig higher-radix implemetatios has aother advatage that is more subtle. For example, to achieve the same latecy, a radix-2 implemetatio eeds approximately twice the streamig width as i a radix- implemetatio (to get the same amout of computatio per cycle). These performace-comparable radix-2 ad radix- implemetatios will also have comparable cost as well. (The same relatioship exists betwee radix- ad radix-8.) The subtle but importat differece is that a w = radix- implemetatio oly requires loadig ad uloadig vector elemets per cycle at the start ad fiish of each computatio istead of 8 elemets per cycle for a comparable performace radix-2 implemetatio. For the same cost ad performace, a higher radix implemetatio is more desirable due to this lower iterface badwidth. Agai, data poits correspodig to the LogiCore DFT implemetatios are show i Figure to provide a baselie. Our horizotal-reuse implemetatios allow more direct comparisos agaist LogiCore s latecy ad cost optimized architectures. Amog our differet latecy/cost implemetatios, the low cost high latecy implemetatios correspod most closely to LogiCore s tradeoff poits. D. Rage of Implemetatios Give the multi-variable ad multi-objective ature of optimizig FFT implemetatios, it is impossible to completely explore the full rage of desigs or to properly compare tradeoffs across all combiatios of metrics. I Table I, we highlight some of the most saliet desig corers attaiable usig the desig choices described i this paper. Colums to 5 specify the correspodig decisios used (problem size, algorithm, radix r, architecture, stream width w). Colums 6 to 9 report the performace ad cost metrics (throughput, latecy, slices used, block-ram used). VI. RELATED WORK A extesive base of fudametal work i FFT algorithms ad architectures for VLSI ad FPGA has laid the foudatio for this work. The mathematical framework described i this 5 The results give for the radix-2 cases agree with our earlier work [2] which dealt specially with radix-2 horizotal-reuse Pease FFT implemetatios. Improvemets see i the curret results are due to the recetly icorporated memory-based permutatio blocks. paper is capable of represetig a wide variety of desigs, icorporatig optimizatios at both the algorithmic ad architectural levels. Examples of prior work i fully-streamed (or pipelied) FFT implemetatios ca be see i [7], [8], ad [9]. I some previous pipelied implemetatios, arithmetic uits are ot fully utilizable (e.g., [7] ad [9]) due to their permutatio implemetatios. Examples of prior work examiig horizotalreuse FFT implemetatios ca be see i [] ad []. Specifically, Pease FFTs with horizotal reuse are discussed i [2] ad [2]. O the whole, may prior developmets have covered much of the same desig space we cosidered i this paper. However, these implemetatios were tued for differet objectives ad targeted differet techologies, prevetig a systematic represetatio of the desig space. Our study is somewhat uique i its extesive coverage of varied implemetatio parameters, usig real RTL desigs ad real FPGA sythesis. Below, we highlight examples of some importat desig choices ot examied i this study. We did ot cosider the impact of fixed-poit precisio [3] or floatig-poit arithmetics []. We cosidered either o-the-fly twiddle geeratio usig CORDIC [5] or distributed arithmetic to optimize the arithmetic pipelie at the bit-level [6]. We cocetrated o performace ad cost as the primary metrics ad did ot cosider the issues of power or eergy [7]. We also did ot cosider FFT processors desiged specifically for executig FFT algorithms [8]. VII. CONCLUSION This paper presets a DFT trasform sythesis flow that captures a importat rage of implemetatio optios. The sythesis flow starts from precise mathematical formulas of fast DFT algorithms ad applies structural rewrite rules to impart appropriate hardware implemetatio decisios. The resultig aotated hardware formulas straightforwardly map to RTL etlists of efficiet implemetatios. This sythesis flow ca be coupled with the exitig Spiral formula geerator to fully automate DFT desig exploratio ad sythesis. The formula laguage ad the sythesis procedure preseted i this paper are actually sufficiet for a wider rage of trasforms i additio to DFT. The system, as is, ca hadle Walsh-Hadamard trasform ad multidimesioal DFTs. The cetral limitatio to supportig a broader class of trasforms is i costructig cost-effective streamig implemetatios of the required permutatios. Recet work [5] has produced very efficiet solutios to this problem. Thus, we pla to cotiue this work o other trasforms (e.g., discrete cosie trasform or the DFT o real valued iputs). ACKNOWLEDGMENT This work was supported by DARPA uder DOI grat NBCH-59 ad by NSF awards ACR ad ITR/ACI

9 Spiral geerated FFT IP Cores vs. Xilix LogiCore = 6 (left), = 256 (ceter), = 2 (right) Latecy versus area Xilix LogiCore Spiral radix Spiral radix 2 Spiral radix 8 Latecy [microsecods] Latecy [microsecods] Latecy [microsecods] Area [slices] Area [slices] Area [slices] Fig.. Latecy versus cost for horizotal-reuse implemetatios of DFT 6, DFT 256, ad DFT 2 (from left to right). algorithm r architecture w throughput latecy cost BRAMs commets (/µs) (µs) (slices) 6 Pease () 2 horiz. reuse lowest cost 6 Iterative (3) 2 fully-streamed best throughput 6 Pease () horiz. reuse lowest latecy per slice 256 Pease () 2 horiz. reuse lowest cost 256 Iterative (3) 6 fully-streamed best throughput 256 Pease () horiz. reuse balaced cost vs latecy 2 Pease () 2 horiz. reuse lowest cost 2 Pease () 32 horiz. reuse lowest latecy 2 Pease () horiz. reuse balaced cost vs latecy TABLE I COMPILATION OF SELECT REPRESENTATIVE IMPLEMENTATIONS AND DESIGN CORNERS. REFERENCES [] C. Va Loa. Computatioal Framework of the Fast Fourier Trasform. SIAM, 992. [2] G. Nordi, P. Milder, J. Hoe, ad M. Püschel. Automatic geeratio of customized discrete Fourier trasform IPs. I Proceedigs of the 2d Aual Coferece o Desig Automatio, 25. [3] M. C. Pease. A adaptatio of the fast Fourier trasform for parallel processig. ACM, 5(2), April 968. [] M. Püschel, J. M. F. Moura, J. Johso, D. Padua, M. Veloso, B. W. Siger, J. Xiog, F. Frachetti, A. Gačić, Y. Voroeko, K. Che, R. W. Johso, ad N. Rizzolo. SPIRAL: Code geeratio for DSP trasforms. Proceedigs of the IEEE, special issue o Program Geeratio, Optimizatio, ad Adaptatio, 93(2): , 25. [5] M. Püschel, P. A. Milder, ad J. C. Hoe. Permutig streamig data usig RAMs. Joural submissio uder preparatio. [6] Xilix, Ic. Xilix LogiCore: Fast Fourier Trasform v3.2. [7] E. H. Wold ad A. M. Despai. Pipelie ad parallel-pipelie FFT processors for VLSI implemetatios. IEEE Trasactios o Computers, C-33(5): 26, May 98. [8] S. F. Gorma ad J. M. Wills. Partial colum FFT pipelies. IEEE Trasactios o Circuits ad Systems II: Aalog ad Digital Sigal Processig, 2(6): 23, 995. [9] S. He ad M. Torkelso. ew approach to pipelie FFT processor. I Proc. Iteratioal Parallel Processig Symposium, 996. [] D. Cohe. Simplified cotrol of FFT hardware. IEEE Trasactios o Acoustics, Speech, ad Sigal Processig, 2(6): , 976. [] G. Szedo, V. Yag, ad C. Dick. High-performace FFT processig usig recofigurable logic. I Proc. Asilomar Coferece o Sigals, Systems ad Computers, 2. [2] M. Serra, P. Martí, ad J. Carrabia. IFFT/FFT core architecture with a idetical stage structure for wireless LAN commuicatios. I Proc. IEEE Workshop o Sigal Processig Advaces i Wireless Commuicatios, 2. [3] P. Kabal ad B. Sayar. Performace of fixed-poit FFT s: Roudig ad scalig cosideratios. I IEEE Iteratioal Coferece Acoustics, Speech, ad Sigal Processig, volume, pages 22 22, April 986. [] K. S. Hemmert ad K. D. Uderwood. A aalysis of the doubleprecisio floatig-poit FFT o FPGAs. I Proc. IEEE Symposium o Field-Programmable Custom Computig Machies, 25. [5] A. Baerjee, A. Sudar Dhar, ad S. Baerjee. FPGA realizatio of a CORDIC based FFT processor for biomedical sigal processig. Microprocessors ad Microsystems, 25(3):3 2, May 2. [6] M. Shaditalab, G. Bois, ad M. Sawa. Self sortig radix 2 FFT o FPGAs usig parallel pipelied distributed arithmetic blocks. I Proc. IEEE Symposium o FPGAs for Custom Computig Machies, 998. [7] S. Choi, G. Govidu, J. Jag, ad V. K. Prasaa. Eergy-efficiet ad parameterized desigs for fast Fourier trasform o FPGA. I Proc. IEEE Iteratioal Coferece o Acoustics, Speech ad Sigal Processig, 23. [8] P. Kumhom, J. Johso, ad P. Nagvajara. Desig, optimizatio, ad implemetatio of a uiversal FFT processor. ASIC/SOC Coferece, 2. I Proc. 3th IEEE

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity