Formal Datapath Representation and Manipulation for Implementing DSP Transforms

Size: px

Start display at page:

Download "Formal Datapath Representation and Manipulation for Implementing DSP Transforms"

Magnus Terry
6 years ago
Views:

1 Formal Datapath Represetatio ad Maipulatio for Implemetig DSP Trasforms Peter A. Milder, Fraz Frachetti, James C. Hoe, ad Markus Püschel Electrical ad Computer Egieerig Departmet Caregie Mello Uiversity Pittsburgh, PA, U.S.A. ABSTRACT We preset a domai-specific approach to represetig datapaths for hardare implemetatios of liear sigal trasform algorithms. We exted the tesor structure for describig liear trasform algorithms, addig the ability to explicitly characterize to importat dimesios of datapath architecture. This represetatio allos both algorithm ad datapath to be specified ithi a sigle formula ad gives the desiger the ability to easily cosider a ide space of possible datapaths at a high level of abstractio. We have costructed a formula maipulatio system based o this represetatio ad have ritte a compiler that ca traslate a formula ito a hardare implemetatio. This eables a automatic push butto compilatio flo that produces a register trasfer level hardare descriptio from high-level datapath directives ad a algorithm (ritte as a formula). I our experimetal results, e demostrate that this approach yields efficiet desigs over a large tradeoff space. Categories ad Subject Descriptors: B.6.3 [Hardare]: Desig Aids Automatic sythesis Geeral Terms: Algorithms, Desig Keyords: liear trasform, discrete Fourier trasform, high-level sythesis, streamig 1. INTRODUCTION Liear sigal trasforms such as the discrete Fourier trasform are ubiquitous i digital sigal processig (DSP) ad scietific computig. Algorithms for computig these trasforms are ofte highly structured ad regular, hich makes them ell suited for hardare implemetatio. This regularity allos a ide space of potetial datapath structures, each givig a differet set of tradeoffs betee performace ad cost. It is very difficult for a desiger to determie the structure that ill yield the most efficiet datapath for give cost or performace costraits. Cotributio. I this paper, e take a domai-specific mathematical represetatio for describig liear DSP al- Permissio to make digital or hard copies of all or part of this ork for persoal or classroom use is grated ithout fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. DAC 28, Jue 8 13, 28, Aaheim, Califoria, USA Copyright 28 ACM /8/6 $5.. gorithms ad exted it to iclude datapath cocepts such as parallelism ad explicit datapath reuse. The result is a mathematical laguage that e compile directly ito hardare. Usig this laguage, a desiger specifies datapath optios at the formula level. This leads to easier exploratio of the desig space by eablig algorithm restructurig through formula maipulatio, hich is performed automatically based o high-level directives. We have costructed a push butto sythesis system that takes as iput a algorithm (ritte as a formula) ad high-level datapath directives (idicatig desired qualities of the resultig desig); it outputs a desig i register-trasfer level (RTL) Verilog. Orgaizatio. We begi by itroducig the tesor (or Kroecker) represetatio for trasform algorithms i Sectio 2. The, Sectio 3 discusses the datapath costructs e cosider, ho e are able to iclude them ithi the existig mathematical represetatio, ad the associated performace ad cost metrics. Additioally, e give a highlevel vie of our sythesis system. I Sectio 4, e evaluate our geerated desigs. We preset experimets that demostrate: (a) that the cost/performace tradeoffs obtaied are competitive ith good had-desiged implemetatios, (b) that this system produces desigs across a ide tradeoff space, ad (c) that real beefits are obtaied by cosiderig a variety of datapath structures. Lastly, e discuss related ork i Sectio 5 ad coclude i Sectio BACKGROUND Trasforms as matrices. A liear trasform may be vieed as a dese matrix; applyig the trasform is the a matrix-vector multiplicatio. For example, a poit trasform characterized by matrix A is give by y = x, here x ad y are the poit iput ad output vectors (respectively), ad is a matrix. Direct evaluatio of the matrix-vector product requires O` 2 arithmetic operatios. Algorithms as formulas. Fast algorithms exist for may trasforms that reduce the arithmetic cost to O`log. We vie a algorithm as a decompositio of the dese matrix ito a product of structured sparse matrices. The tesor (or Kroecker) formulatio has bee sho to be a compact ad efficiet ay to represet fast trasform algorithms [5, 11]. Recetly, others have sho that a frameork based o this formulatio ca be used to geerate optimized softare

2 x B (a) B x x 1 x 2 x 3 y x x 1 A 2 y y 1 x 2 x 3 A 2 y 2 y 3 (b) I 2 A 2 i x x 1 x 2 x 3 d d1 d2 d3 (c) D 4 y y 1 y 2 y 3 (d) DFT 4 = P 4 (I 2 ) P 4 D 4 (I 2 ) P 4 Figure 1: Examples of traslatig from formula to combiatioal datapath. for today s high-performace computer systems [1]. Formula laguage. This algorithmic represetatio is captured i a formal laguage that represets algorithms usig formulas, ith each term i the formula havig a correspodig combiatioal datapath represetatio. I Backus- Naur form, the laguage is defied as follos (o-termials are bold-faced): matrix ::=matrix Q matrix i matrix I k matrix m here = km base base ::=D = diag(d,..., d 1) P I A matrix formula ca be decomposed ito a product or iterative product of matrix formulas (lies 1 ad 2, illustrated i Figure 1(a)). Matrix I k is the k k idetity matrix, ad I k matrix m is a tesor (or Kroecker) product, here k parallel istaces of matrix m are applied to the data vector of size = km (Figure 1(b)). We use P to deote a permutatio o poits ad D to represet a diagoal matrix, hich has o-zero values alog the mai diagoal oly, causig each value of the iput vector to be scaled by a costat (Figure 1(c)). Lastly, e use to deote a geeric dese matrix, hich correspods to a computatioal basic block. This laguage is a subset of the sigal processig laguage (SPL) used i Spiral, a program geerator for softare implemetatios of liear trasforms [1]. A algorithm ritte i this laguage ca be mapped directly to a combiatioal datapath (Figure 1(d)), but the resultig datapath is ifeasibly large for all but the smallest problem sizes. 3. DATAPATH REPRESENTATION The tesor laguage described above ca represet a ide rage of algorithms, but it does ot have the capability of represetig sequetial reuse of datapath compoets, here oe computatioal block is used may times hile solvig a sigle problem. Sequetial reuse is ecessary for efficiet ad reasoably sized hardare desigs. I this sectio, e describe extesios to our formula laguage to represet to types of sequetial reuse that are relevat for hardare desigs. We sho ho these extesios eable explicit datapath descriptio at the formula level ad discuss ho formulas are automatically traslated ito register-trasfer level datapath descriptios. y y 1 y 2 y 3 size (m ) vector (a) No streamig reuse: I m. ords per cycle oe streamed vector, size (m ) m cycles (b) Full streamig reuse: I m sr. ords per cycle m/ cycles (/) blocks (c) Partial streamig reuse: I m/ sr `I /. Figure 2: Examples of streamig reuse. 3.1 Streamig Reuse As e sa i Sectio 2, the tesor product I m idicates m data-parallel istatiatios of the block (Figure 2(a)). Hoever, the same computatio ca be performed by other structures. For example, the tesor product ca be iterpreted as reuse i time (rather tha parallelism i space). The, e build a sigle istace of block ad reuse it over m cosecutive cycles (Figure 2(b)). Rather tha all m iput poits eterig the system cocurretly, they o stream i ad out at a rate of ords per cycle. We call this streamig reuse ad represet it I m sr. We defie streamig idth as the umber of iputs (or outputs) that eter (or exit) a sectio of datapath durig each cycle. Here, the streamig idth is. We ca est the to iterpretatios of i order to build a partially parallel datapath that is reused over multiple cycles (Figure 2(c)). I geeral, I m ca be ritte as I m/ sr (I / ), hich results i a datapath ith a streamig idth of, cosistig of / parallel istaces of, reused over m/ cycles ( is a multiple of ; m). Icreasig the streamig idth icreases the datapath s cost ad throughput proportioally. 3.2 Iterative Reuse Q The product of m idetical blocks ca be ritte as m A. A straightforard iterpretatio of this is a series of m blocks (Figure 3(a)). We ca also perform the same computatio by reusig the block m times (Figure 3(b)). No, the datapath must have a feedback mechaism to allo the data to cycle through the proper umber of times. We call this iterative reuse ad represet it by addig the letters ir to the product term: Q ir m A. By estig both kids of product terms, e specify a umber of blocks i sequece to be reused a umber of times (Figure 3(c)). I geeral, Q m A ca be restructured ito Q ir m/d (Q d A), resultig i d cascaded istaces of, iterated over m/d times (m/d is a iteger). We defie depth as the umber of stages built (here, d).

3 m blocks (a) No iterative reuse: Q m A. 1 block, reused m times (b) Full iterative reuse: Q ir m A. d blocks, reused m/d times (c) Partial iterative reuse: Q ir `Q m/d d A. Figure 3: Examples of iterative reuse. Whe a iterative reuse datapath is built, it is importat that the reused portio of the datapath buffer the etire vector, so the head of the data maitais sufficiet distace from its o tail. This is equivalet to requirig that the latecy (i cycles) be at least 1/(its throughput i trasforms per cycle). If the datapath does ot aturally have this property, it is ecessary to add buffers to icrease its latecy. We ill see a example of this i the folloig sectio. 3.3 Combiig Streamig ad Iterative Reuse Ofte, trasform algorithms cotai the form Q k (Im ). This structure ca utilize both iterative reuse (due to the Q ) ad streamig reuse (due to I m ), alloig a ide rage of hybrid implemetatios that exhibit flexibility across to dimesios. We ca restructure this formula to have streamig ad iterative reuse of parameterized amouts: Q ir k/d `Q d (I m/ sr (I / )), here d is the depth of the cascaded stages (ragig from 1 to k; k/d must be a iteger). Parameter is the streamig idth, a multiple of. This parameterized datapath is illustrated i Figure 4. Each stage cosists of / parallel istaces of ; d stages are built i series. Let B m represet this array of d/ may blocks. Data are loaded ito B m at a rate of per cycle over m/ cycles. The vector feeds back through B m a total of k/d times. Latecy ad throughput. Give this combied reuse example, e ca aalyze the effect of parameters d ad o the datapath. I Table 1, e first describe several geeral rules for derivig the latecy, throughput, ad a approximate area cost of basic formula costructs. Belo, e preset calculatios that correspod to evaluatig the geeral rules from Table 1 for the specific parameters of this combied reuse example (Figure 4). I these calculatios, e assume that B m (the collective block of blocks) is fully pipelied, i.e., its throughput is dictated by the problem size ad streamig idth oly: T(B m) = /m. The aalysis of latecy ad throughput for this combied reuse example icludes the folloig to cases: ords per cycle m/ cycles d stages, reused k/d times (/) blocks per colum Figure `Q 4: Combiig iterative ad streamig reuse: d (I m/ sr (I / )). Q ir k/d Case 1: Iterative reuse. This case occurs he d < k, meaig the data ill iterate over the iteral block at least 2 times. As discussed i Sectio 3.2, the iteral block s miimum latecy is determied by its throughput. So, if d L() < m/, buffers are added util they are equal. Thus, iteral block B m has latecy L(B m) = max(m/, d L()). The latecy of the hole system is k/d times this, givig latecy = max(mk/d, k L()). Because e are utilizig iterative reuse, a e vector caot eter util the previous vector begis exitig the datapath, so the throughput (i trasforms per cycle) is the iverse of the latecy, mi(d/mk, 1/(k L())). Case 2: No iterative reuse. This case occurs he d = k. No, o iterative reuse is performed; the data oly passes through the ier block oce. The datapath cosists of d = k stages, givig latecy = k L(). Because the data ever feeds back, the throughput is limited oly by the streamig idth, givig throughput = /m trasforms per cycle. From these equatios, e see that icreasig ad d ill lead to loer latecy ad higher throughput i equal eights, util either the data flos so quickly that the latecy of the computatio domiates (d L() > m/), or d icreases util o iterative reuse is performed (d = k). Flexibility. Additioally, there is oe importat distictio that must be made betee parameters d ad : as gros, the datapath requires greater badidth at its ports, ad the cost of itercoect ad multiplexers icreases. For this reaso, it is preferable to icrease d istead of. Hoever, e also ote that d must divide k evely (k is typically the log 2 of the trasform size). I may cases, this becomes a all or othig situatio, here the oly optios are d = 1 ad d = k. I those cases, the added flexibility provided by is importat. Lastly, e ote that he the datapath does ot employ iterative reuse (i.e., he d = k), the desiger typically has a ider choice of algorithms because the iteral stages are ot required to be uiform. Datapath efficiecy ad vector iterleavig. Assume e have a iterative reuse datapath that reuses block B. Here, B ca represet ay datapath e cosider i this paper, icludig those ith further iterative reuse iterally. B has a iheret latecy L(B ) ad throughput T(B ) (determied by the iverse of the miimum iitiatio iterval of iput vectors). With a sigle vector recirculatig through B, the effective throughput of B may be further limited to 1/L(B ) if L(B ) is greater tha the miimum iitiatio iterval. I this case the head of the vector is still iside B he B s iput is ready to accept a e iteratio.

4 Formula F Latecy L(F) Throughput T(F) Area cost C(F) F = A () A (1) A (m 1) Pi (L(A(i) )) mi(t(a (i) )) Pi (C(A(i) )) F = Q ir k A max( k T(A) 1 T() mi(, k k L() C(A) + C(mux) F m = I m L() T() m C() F m = I m sr L() T()/m C() Table 1: Give a matrix formula F, formulas for latecy L(F) (i cycles), throughput T(F) (i trasforms per cycle) ad approximate area cost C(F) (relative to the area cost of sub-modules). We ca defie a utilizatio ratio R of the effective throughput to the iheret throughput of B ; this quatifies the portio of B s potetial throughput that is utilized i the system. For a sigle vector, R = (1/L(B ))/T(B ). Whe the utilizatio by a sigle vector is sufficietly lo, e ca iterleave multiple vectors to make use of the full throughput capacity of B. Formally, if R 1/V (here V is a iteger), e may iterleave V computatios through the datapath, icreasig the effective throughput ad thus icreasig the utilizatio ratio to R = (V/L(B ))/T(B ). I some cases, a desiger may at to icrease L(B ) artificially for better efficiecy. For example, if R =.55, the desiger could isert delay buffers i the datapath (icrease L(B )) util R is reduced to.5 ad the iterleave to vectors. This icreased utilizatio yields higher throughput at the expese of added latecy, so the desiger s particular applicatio requiremets ill determie the suitability of this approach. Our compilatio frameork, discussed ext, ca utilize either strategy. 3.4 Compilatio: From Math to RTL We have built a compilatio frameork that takes a algorithm ritte as a formula, automatically maipulates it to describe a datapath, ad traslates the resultig desig ito register-trasfer level (RTL) Verilog. A full explaatio of this compilatio frameork is outside of the scope of this paper, but e preset a high-level descriptio here. First, the algorithm is expaded ito a formula i the laguage defied i Sectio 2. This formula correspods to the computatio that ill be performed but does ot specify the structure of the desig that ill perform it. Next, datapath directives are added that give desired characteristics of the fial implemetatio (e.g., streamig idth). A formula reritig system the propagates these directives ito the formula ad restructures each term to match the desired characteristics. A hardare formula is produced that explicitly specifies the datapath architecture. Lastly, the hardare formula is traslated to a RTL etlist that has the desired reuse characteristics. Computatioal blocks are implemeted accordig to the base matrices ad streamig permutatios are built ith memory ad itercoectio etorks. 4. EVALUATION I this sectio, e evaluate desigs produced usig the proposed method. First, e explai our methodology. The, e give several examples of trasform algorithms that utilize I ad Q ad demostrate that streamig ad iterative reuse lead to a ide tradeoff space i the resultig desigs. We compare our geerated desigs ith existig bechmarks i order to demostrate the quality of our cores. We also evaluate a trasform ith a ider tradeoff space ad examie ho high-level desig decisios affect the resultig desig. Lastly, e discuss the geerality of this approach ad sho other trasform algorithms that utilize I ad Q, i.e., algorithms that ca be implemeted ith streamig ad iterative reuse. 4.1 Methodology We have implemeted the compilatio frameork that is described i Sectio 3.4 as a e backed to the Spiral formula geeratio frameork [1]. Spiral is used to geerate the startig formula for a give trasform, ad e have modified the tool to perform the formula maipulatio associated ith our extesios to the tesor formula laguage. Lastly, e have ritte a stadaloe compiler that traslates a hardare formula ito a register-trasfer level (RTL) descriptio. The tools are itegrated, resultig i a completely automated flo from problem descriptio to RTL Verilog. I this sectio, e evaluate various desigs produced ith our frameork. Here, e target the Xilix Virtex-5 LX 33 FPGA ad geerate desigs that use a 16 bit fixed poit data type. 1 We use Xilix ISE 9.1i to sythesize ad place ad route the desigs. Whe memory is required, e use a o-chip block RAM (BRAM) if e ca utilize 5% of its storage capacity. Otherise, e use distributed RAM, i.e., memory distributed across the FPGA s logic cells. Although a FPGA platform provides a coveiet target for evaluatio, the desigs e geerate are ot limited to FPGAs. 4.2 Quality of Geerated Desigs I this sectio, e demostrate that the desigs produced by our frameork are competitive ith cores that are commercially available or foud i recet literature. We choose the discrete Fourier trasform (DFT) of size 124; e evaluate our desigs relative to cores from the commercially available Xilix LogiCore FFT versio 4.1 ad the desigs from our previous ork [8]. The algorithm ad architectures e cosidered i [8] are subsets of the space e cosider here. We geerate cores based upo to algorithms: the Pease FFT [9] ad a iterative versio of the Cooley-Tukey FFT [3]. Although they are differet algorithms, at the high level, both are of the form: DFT log r () 1 i= 1 P (I /r DFT r)d A P, here r is a poer of to. (See Sectio 2 for a explaatio of the terms i this formula.) We geerate a variety of desigs usig these algorithms ith various depths, streamig 1 Data type ad bit idth are parameters of our geeratio frameork. Curretly, our tool supports fixed poit data types of ay bitidth ad sigle precisio floatig poit.

5 DFT 124 (16 bit fixed poit) o Xilix Virtex-5 FPGA throughput [millio samples per secod] 3, 2,5 2, Geerated, ith iterative reuse Geerated, ithout iterative reuse Nordi, DAC5 Xilix LogiCore FFT v4.1 Perm (I DFTr) d 2 stages (I DFT r) (a) d 1 = 1. The outer product term is iteratively reused. 1,5 1, 5 Perm (I DFTr) d 2 stages (I DFT r) 2, 4, 6, 8, 1, 12, 14, area [slices] Figure 5: Throughput for varyig implemetatios of DFT 124. idths, ad radices (values of r i the formula above). Here e cosider steady state throughput (give i millio samples per secod) as our performace metric ad area (i terms of FPGA slices) as our cost metric. Figure 5 shos throughput for varyig implemetatios of DFT 124. From our data e plot oly the Pareto optimal poits, i.e., those that are ot eclipsed by aother desig that is both smaller ad faster. From these results, e see that the cost ad performace values of the LogiCore desigs are similar to those of our smallest cores. Furthermore, e see that our larger cores provide a commesurate icrease i performace for the extra resources they cosume. Our previous ork [8] covers a small subset of the datapath ad algorithmic optios e cosider i this paper. I Figure 5, e see that the added flexibility of our curret method leads to sigificat improvemets over [8]; the desigs i our Pareto optimal set all provide higher performace at loer cost. Similar treds are obtaied if e choose a differet value for ad/or measure latecy istead of throughput. 4.3 Automatic Desig Space Exploratio I this sectio, e cosider a algorithm for the todimesioal discrete Fourier trasform (2D DFT). This algorithm utilizes I ad to Q terms, givig a very ide space of possible datapaths. This algorithm operates o 2 poits ad has the folloig form:! 1 t 1 DFT = P 2(I 2 /r DFT r)d 2 k= l= P 2 here t = log r (). This gives to iterative product terms, as see above. Each may utilize iterative reuse, hich e characterize ith depth parameters d 1 ad d 2 (see Sectio 3.2). This is illustrated i Figure 6. We defie d 2 to be the depth of the ier product; d 2 ca be betee 1 ad t, provided that t/d 2 is a iteger. Each iteral block (a shaded regio i Figure 6) cosists of d 2 stages of P (I DFT r) D, streamed ith ports. We defie d 1 to be the depth of the outer product; d 1 ca be either 1 or 2. Whe d 1 = 1, the outer product term is iteratively reused (Figure 6(a)). Whe d 1 = 2, the outer term is urolled, givig to cascaded stages as see i Figure 6(b). Exploratio. No, e preset results of a datapath exploratio for 2D DFT We geerate cores across all!, Perm (I DFT r) d 2 stages (I DFTr) (b) d 1 = 2. The outer product term is fully urolled, i.e., ot iteratively reused. Figure 6: Illustratios of DFT ith outer product term parameterized by d 1, ier product term parameterized by d 2, ad streamig idth. possible values of d 1 ad d 2 ith the streamig idth ragig betee r ad 16. Parameter r, the radix, is 2, 4, or 8. Summig these possibilities for d 1, d 2,, ad r, e have a total of 52 differet architectures i this desig space. We geerate each hardare core, sythesize it, ad place ad route it. I Figure 7, e sho the throughput (i millio samples per secod) versus area (i FPGA slices) for all 52 data poits, ith differet markers for each value of. The black lie passes through the Pareto optimal poits. I this data set, the smallest Pareto optimal poit is the maximally folded desig: = 2, d 1 = 1, d 2 = 1. From there, e cotiue alog the Pareto optimal set by first icreasig the d parameters hile keepig = 2 (hite diamods i Figure 7). Icreasig yields desigs i the Pareto optimal set oly after several values of d 1 ad d 2 have bee icluded. This observatio that it is preferable to first icrease d 1 ad d 2 before icreasig is supported by our theoretical uderstadig of streamig ad iterative reuse as outlied i Sectio 3.3. Hoever, it is ot obvious hich parameter combiatios ill yield desigs i the Pareto optimal set, ad it is difficult to determie the crossover poits here oe desig parameter becomes more importat tha aother. This highlights the importace of a automatic geeratio system; it ould be exceedigly difficult to complete such a desig exploratio by had. 4.4 Geerality The datapath cocepts cosidered i this paper (streamig reuse of I m ad iterative reuse of Q k A) apply to algorithms for trasforms other tha those already discussed. I this sectio, e preset several problems that fit ithi these structures. For example, the Walsh-Hadamard trasform (WHT) ca be computed ith a algorithm of the form t 1 WHT r t = ((I r t 1 WHT r)p r t), k= ad the real discrete Fourier trasform (RDFT) ca be com-

6 D 64x64 (16 bit fixed poit) o Xilix Virtex-5 FPGA throughput [millio samples per secod] 1,6 1,4 1,2 1, Streamig idth 2 Streamig idth 4 Streamig idth 8 Streamig idth 16 5, 1, 15, 2, area [slices] Figure 7: Throughput versus area for 2D DFT The streamig idth is idicated by the data marker. puted usig a algorithm of the form 1 log 2 (m) RDFT 4m = P (I m l RDFT 4(l, k))p 4m A P 4m. k= The WHT algorithm is completely expressible i the laguage e cosider, ad the RDFT requires oly a small additio. Both algorithms cotai iterative product Q ad tesor product I A, hich meas that iterative ad streamig reuse ca be applied to each. We have implemeted both of these algorithms i our frameork ad have geerated ad evaluated datapaths for both. Other fast liear trasform algorithms ca be ritte usig Q ad I A, meaig that streamig ad iterative reuse aturally apply. For example, [1] shos algorithms of this sort for discrete sie ad cosie trasforms (DST ad DCT). Lastly, e poit out that streamig ad iterative reuse ca apply to other umerical problems outside of the domai of liear trasforms. For example, Viterbi decodig is performed usig a dataflo quite similar to the discrete Fourier trasform, ad may matrix-matrix multiplicatio algorithms exhibit parallelism hich ca be expressed by the tesor product. By extedig our frameork beyod liear trasforms, e may be able to efficietly describe datapaths for these types of problems. 5. RELATED WORK Although e do ot ko of ay other istaces of the tesor formula laguage beig exteded to support a geeral class of hardare implemetatios i this maer, it has bee used i the process of desigig special purpose hardare (e.g., a FFT processor i [7] ad FFT cores i our previous ork [8]). The importat distictios are that either approach exteds the formula laguage to describe datapath structure ad that either compiles from the formula to hardare; the formula is used to describe the algorithm oly. May methods have bee proposed to compile hardare from a softare-like level of abstractio. This ork differs from ours i the level of represetatio (typically C or Matlab code) ad i scope. Lastly, may special purpose FFT implemetatios have bee proposed i the literature that have features that correspod to the datapath structures e are iterested i. To ame just a fe, [6] is a example of a desig ith streamig reuse, ad cores ith streamig ad iterative reuse are developed i [2, 4, 8]. 6. CONCLUSIONS Liear DSP trasforms ad their algorithms are ell uderstood ad ca be formally described i a compact maer ith the tesor product formulatio. I this ork, e exteded this frameork to allo the represetatio of the datapath cocepts of streamig ad iterative reuse. This eables a domai-specific, formula-level vie of hardare desig ad allos datapath maipulatio to take place automatically at the mathematical level. We have implemeted these ideas i a automatic desig flo that maipulates a formula based upo high-level directives ad produces a desig i RTL Verilog. Lastly, e have preseted results that demostrate the breadth of these techiques ad have established the quality of the geerated desigs. 7. ACKNOWLEDGMENTS This ork as supported by NSF through aards ad ad by DARPA through Departmet of Iterior grat NBCH159 ad ARO grat W911NF REFERENCES [1] J. Astola ad D. Akopia. Architecture-orieted regular algorithms for discrete sie ad cosie trasforms. IEEE Trasactios o Sigal Processig, 47(4): , [2] D. Cohe. Simplified cotrol of FFT hardare. IEEE Trasactios o Acoustics, Speech, ad Sigal Processig, 24(6): , [3] J. W. Cooley ad J. W. Tukey. A algorithm for the machie calculatio of compex Fourier series. Mathematics of Computatio, 19(9), [4] N. Dave, M. Pellauer, S. Gerdig, ad Arvid a trasmitter: a case study i microarchitectural exploratio. I MEMOCODE, 26. [5] J. Graata, M. Coer, ad R. Tolimieri. The tesor product: a mathematical programmig laguage for FFTs ad other fast DSP operatios. Sigal Processig Magazie, IEEE, 9(1):4 48, [6] S. He ad M. Torkelso. e approach to pipelie FFT processor. I Proc. Iteratioal Parallel Processig Symposium, [7] P. Kumhom, J. Johso, ad P. Nagvajara. Desig, optimizatio, ad implemetatio of a uiversal FFT processor. I Proc. 13th IEEE ASIC/SOC Coferece, 2. [8] G. Nordi, P. A. Milder, J. C. Hoe, ad M. Püschel. Automatic geeratio of customized discrete Fourier trasform IPs. I Desig Automatio Coferece (DAC), pages , 25. [9] M. C. Pease. A adaptatio of the fast Fourier trasform for parallel processig. Joural of the ACM, 15(2), April [1] M. Püschel, J. M. F. Moura, J. Johso, D. Padua, M. Veloso, B. W. Siger, J. Xiog, F. Frachetti, A. Gačić,. Voroeko, K. Che, R. W. Johso, ad N. Rizzolo. SPIRAL: Code geeratio for DSP trasforms. Proc. of the IEEE, 93(2): , 25. [11] C. Va Loa. Computatioal Frameorks for the Fast Fourier Trasform. SIAM, 1992.

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity