Asymmetrical Load-Balancing for Incremental Fast Fourier Transform on Multi-Core Processors

Size: px

Start display at page:

Download "Asymmetrical Load-Balancing for Incremental Fast Fourier Transform on Multi-Core Processors"

Wilfred McLaughlin
6 years ago
Views:

1 Asymmetrical Load-Balacig for Icremetal Fast Fourier Trasform o Multi-Core Processors By Todor Padeliev A thesis submitted to The Faculty of Graduate Studies ad Research i partial fulfilmet of the degree requiremets of Master of Sciece i Iformatio ad Systems Sciece Computer Sciece Ottawa-Carleto Istitute of Computer Sciece Departmet of Computer Sciece Carleto Uiversity Ottawa, Otario, Caada September, 009 copyright 009, Todor Padeliev

2 Abstract The Fast Fourier Trasform (FFT) is a powerful method i cotemporary computig, with lots of practical applicatios from sigal processig to cryptography. O multicore platforms, symmetrical load distributio prevails but has efficiecy issues with locality creatig a gap betwee theoretical arithmetic complexity ad actual performace. Some proposed re-evaluatios of Amdahl s law for parallel speed-up favour asymmetric multi-processig. The approach take i this work is therefore to let the idividual processors/cores specialise i parts of the FFT such as butterfly operatio, permutig/trasposig, calculatig complex roots of uity; thus computatio is asymmetrical eve o SMP/CMP. Iter-thread sychroisatio employed is spi-lock. This research cotributes: the otio of icremetality iheret i the applicatio, iovative usage of a shared heap to hold twiddle-factors, ad best kow arithmetic complexity of computig these. The solutio suits hard to predict problem sizes, o the up to 9 cores that the multi-core idustry delivers owadays (9 i IBM s CellBE). ii

3 TO THE EUROPEAN COUNTRY THAT RAISED ME ACADEMICALLY, BULGARIA, AND TO CANADA FOR ALLOWING ME TO EXCEL. iii

4 Ackowledgemets I would specifically like to express appreciatio to my co-supervisors Dr. Michiel Smid ad Dr. Richard Dasereau, for their participatio i the iitial discussio of the topic, as well as for logistical support. Dr. Smid cotributed a lot to shapig up this research through valuable cocise remarks, to the poit, all alog throughout the work. My family comprisig my wife Valeria Padelieva, so Velia ad daughter Atoia, deserve thaks for their patiece, uderstadig ad ecouragemet durig my studies at Carleto Uiversity. The overall academic atmosphere at the School of Computer Sciece, ad the excellet level of the semiars i particular, set the cotext that made this possible. May thaks go to the School s graduate director (util recetly) Dr. J.-P. Corriveau for his support i my trasitio ito the departmet, ad to Claire Rya for her assistace with admi matters. Dr. Kraakis cotributed to improvig the clarity of the arrative. Dr. Paario from the Departmet of Mathematics helped to fix ad improve the mathematics i this work. iv

5 Table of Cotets Abstract Ackowledgemets Table of Cotets List of Figures List of Acroyms ii iv v viii ix Chapter 1: Itroductio Termiology Used State of the Art i FFT Efficiecy o Multi-core Amdahl s law Gustafso s law Iput/Output Complexity Cache Obliviousess Recocilig Space ad Time Locality with FFT Motivatio for This Research Icremetality of Importat Applicatios Multi-core specifics Goals of This Research Approach ad Methodology Specific Solutio for Computatio of Twiddle-factors Specific Solutios for Icremetality v

6 Chapter : Itroducig the Fast Fourier Trasform 1.1 Basic defiitios Polyomials Alterative Represetatios of Polyomials Sigals ad Frequecies Discrete Fourier Trasform (DFT) Properties of the Complex Roots of Uity Fast Fourier Trasform (FFT) ad the Cooley-Tukey Algorithm Recursive Algorithm for Radix- DIT FFT Towards Iterative Algorithm for Radix- DIT FFT Applicatios of the FFT FFT o Multi-core Platforms Theoretical Arithmetic Complexity of the FFT Decimatio, Parallelism ad Multi-core Pipelies ad Sigle Istructio/Multiple Data (SIMD) Stockham Autosort ad Pease s Algorithm... 4 Chapter 3: Geeric Desig The Solutios Itroducig the Middle-Heap Icremetal Computatio of Twiddle Factors the Mea-middle Method Load-Balacig for Icremetality Core-Mootoic Strategy The Algorithms Middle-heap Algorithms Threadig ad Sychroisatio Desig of the FFT Pla Shuffle for Bit-reversed Iput DIF Algorithm for Ordered Output vi

7 3..6 Correctess of the DIF Algorithm for Ordered Output Efficiecy of the DIF Algorithms Itroducig Iput Isertio Optimise DIF Algorithm by Urollig Recursio Leaves Chapter 4: Aalysis of the Geeric Desig Properties of the Middle-heap Properties Derived from Mi-Heap Properties Derived from Complex Roots of Uity Computatioal Complexity of Traversal Arithmetic Complexity of Computig Twiddle-factors Time Locality ad Pipeliig Twiddle-factor Computatio FFT Pla Space Locality Cache Misses with Twiddle-factor Computatio Cache Misses with the FFT Pla Avoidig false sharig Fial Optimisatio of Load-balacig Chapter 5: Coclusios ad Future Work 55 Bibliography 56 vii

8 List of Figures Figure.1. Amplitude Modulatio... 0 Figure 3.1. Max-Heap... 6 Figure 3.. Middle-Heap... 7 Figure 3.3. Meas of Adjacet Twiddle Factors... 8 Figure 3.4. Sequece of Multi-core Processig Figure 3.5. DIF Natural vs. Bit-Reversed Iput Figure 4.1. Error Propagatio of Mea-middle Method viii

9 List of Acroyms ADC AM ALU AMP API BCE CMP CPU DMA DFT DIF DIT DMA DSP FFT FFTW FLOP FM FPGA Aalog-to-Digital Coverter Amplitude Modulatio Arithmetical-Logical Uit Asymmetrical Multi-Processig Applicatio Programmig Iterface Base Core Etity Chip Multi-Processig (SMP o a sigle chip) Cetral Processig Uit Direct Memory Access Discrete Fourier Trasform Decimatio I Frequecy Decimatio I Time Direct Memory Access, copyig without processor participatio Digital Sigal Processig/ Digital Sigal Processor Fast Fourier Trasform Fastest Fourier Trasform i the West (a plaer/executor tool) FLoatig-poit OPeratio Frequecy Modulatio Field-Programmable Gate Array ix

10 IDCT IFFT MAC MKL OFDM PPU PVR SIMD SMP SPU WHT WLOG Iverse Discrete Cosie Trasform Iverse Fast Fourier Trasform Multiply-ad-ACcumulate Math Kerel Library by Itel Orthogoal Frequecy Divisio Multiplexig PowerPC core o the IBM CellBE processor Poit-Value Represetatio (of polyomials) Sigle Istructio Multiple Data Symmetrical Multi-Processig Syergistic Processig Uit core o the IBM CellBE processor Walsh-Hadamard Trasform Without Loss Of Geerality x

11 Chapter 1: Itroductio 1.1 Termiology Used Multi-core processor techology is a moder versio of parallel multi-processig, i which several but ot umerous, usually a sigle-digit umber of cores are laid out by idustry o the same processor. Oe of the aims is to distribute the geerated heat more evely across the itegrated circuit, so that higher performace is achieved without icreasig the processor s clock rate, which ievitably creates heat dissipatio issues. I the title of this work, asymmetrical load-balacig is used i a sese similar to the well-kow meaig i AMP (Asymmetrical Multi-Processig). AMP refers to processor desigs with cores cosistig of differet/specialised architectures, as well as to ruig differet applicatios o each core possibly optimised for differet hardware features, but also because each core is assiged its ow thread. Symmetrical o the other had, whe referrig to hardware, meas that the processor cores are idetical. Whe used i cojuctio with Operatig Systems/Software, this usually meas that the differet cores do ot ru the same code (o differet data) i parallel a multi-core variat of vectorizatio ad SIMD, which some compiler code-geeratio optimisers ad libraries like Itel s MKL are capable of achievig trasparetly for the programmer.

12 Chapter 1: Itroductio The title uses the word icremetal to idicate that ot all iput data may be available at oce at the start of the computatioal process. Ideed most sigal-processig applicatios, FFT beig part of these, rely o samplig voltage at regular itervals. Except for a couple of ewly itroduced terms e.g. middle-heap, the rest of the wordig i this work is iteded to coform completely to the widely-accepted terms ad their meaigs i cotemporary literature o techology. The oly peculiarity worth metioig is that sometimes terms from the area of algorithms ad computer sciece are less kow to egieerig ad electroics people, e.g. heap, ad vice-versa. Amog the possible stumblig blocks for people outside electrical ad electroics egieerig could be DMA (Direct Memory Access) copyig of values i memory without the participatio of the processor; DSP (Digital Sigal Processig/Digital Sigal Processor) deotig a serially maufactured but specialised processor for computatioally-itesive, usually embedded, applicatios. Alog the same lies are FPGA (Field-Programmable Gate Array) a itegrated circuit whose logic is programmable rather tha hard-wired, MAC (Multiply-ad-Accumulate) a computatioal patter i DSP techology. There are also terms that are kow to ay professioal i the idustry of software developmet, but may ot be very commoly used by academic researchers. Amog these is API (Applicatio Programmig Iterface) meaig a fuctio/fuctios to be called, or spi-lock computatioally efficiet but crude iter-thread sychroisatio where oe core is i a uproductive loop while waitig to be released by aother. Key to this work is also the otio of space locality meaig the aim to achieve few cache misses see sectio 1..4, ad time locality, explaied i detail i sectio.3.1.

13 Chapter 1: Itroductio 3 1. State of the Art i FFT Efficiecy o Multi-core 1..1 Amdahl s law Almost forty years ago i [3] Gee Amdahl argued i favour of a sigle-processor approach for achievig large-scale computig capabilities, as opposed to multiprocessig. I the process, he defied his law for the case of usig processors (cores) i parallel. Assumig that fractio p of a program s executio time is parallelisable (igorig schedulig overhead), while 1 p is strictly sequetial, the speedup o processors is: S parallel 1 (1 p) p Amdahl s law has a few corollaries, oe of which, amely that whe p is small optimisatios have little effect, was i support of his argumet that high performace computig should rely o sigle processors. The most importat corollary is this: as approaches ifiity, speedup is boud by 1/(1 p). 1.. Gustafso s law I 1988 [11] was published, ad later became kow as the Gustafso(-Barsis) Law. Here is how he himself summarises it: The model is ot a cotradictio of Amdahl's law as some have stated, but a observatio that Amdahl's assumptios do't match the way people use parallel processors. People scale their problems to match the power available, i cotrast to Amdahl's assumptio that the problem is always the same o matter how capable the computer. For N processors, the parallel part p will scale to pn: S scaled = (s+pn) (s+p)=s+pn = N + (1 N)s = N (N 1)s where s 1 p.

14 Chapter 1: Itroductio 4 As a result, few today claim that parallel processig is ot viable. I [1] Hill ad Marty argue two importat results (amog others), quoted exactly: Result 1. Amdahl s law (still) applies to multicore chips because achievig the best speedup S requires p to be close to 1. Thus, fidig parallelism is still critical. Result. Asymmetric multicore chips ca offer potetial speedups that are much greater tha symmetric multicore chips (ad ever worse). Sice by asymmetric they mea fittig a varyig umber of Base Core Elemets (BCEs) of the same, ot differet, architecture i each core, i this work we adopt the idea of asymmetric loads for CMP. A spi-lock is a efficiet idle loop i oe core util aother core is ready ad releases it. While the ad-hoc load-balacig will ot be perfect, the simplicity of spi-locks will miimise the iheretly serial sychroisatio overhead. This sectio further focuses o literature coverig the particular task of optimisig the FFT ad/or similar computatioal problems, with respect to locality ad parallelism Iput/Output Complexity Efficiet algorithms have to cosider the use of slower memory ad exteral storage, alog with the couts of operatios. While today memory has eve more layers, takig ito accout cache at various levels, some earlier results o efficiecy of algorithms still apply whether cache versus mai memory is cosidered, or mai memory versus disk ca be immaterial. I [1] Aggrawal ad Vitter cosider a model with these parameters: N = # records to hadle; M = # records that ca fit ito iteral memory; B = # records that ca be trasferred i a sigle block; P = # blocks that ca be trasferred cocurretly;

15 Chapter 1: Itroductio 5 where 1 P M N ad 1 P M B. The parameters N, M, ad B are the file size, memory size, ad block size, respectively. For FFT, the asymptotically-optimal algorithm based o Radix- DIF recursio, is show to require N PB log(1 N ) B log(1 M B I/O operatios (o tight lower boud is kow). This is achieved through so called pebblig, ad brigig the records ito memory i traspositio permutatios, M at a time i logn/logm stages (assume logm divides logn) Cache Obliviousess The ideal cache model is defied as follows: the CPU oly uses words that are i cache; if the refereced word is already i cache, a cache hit occurs, ad the word is used; else a cache miss causes a fetch from memory, possibly optimally evictig from the cache. Cache ormally cosists of cache lies, each cotaiig L cosecutive words copied together to ad from mai memory, L>1 coutig o data space locality for efficiecy. Algorithms are cache oblivious whe o parameters depedet o the hardware platform, such as cache size or cache-lie legth, eed tuig to achieve asymptotical optimality. I their ladmark paper [9] Frigo et al. prove the followig for FFT: with cache of size Z ad cache-lie legth L, whe Z = Ώ(L ) defied as the tall cache assumptio, their 6-step versio (see sectio.3) of the -poit FFT with factorisatio is cache-oblivious (if traspose is cache-oblivious), ad icurs 1 1 log Z cache L misses. They also prove that cache-obliviousess is preserved with multiple cache levels.

16 Chapter 1: Itroductio Recocilig Space ad Time Locality with FFT From amog the plaer-orieted solutios, the most uique oe is i [16], ad uses a learig strategy ad heuristics. The authors Siger ad Veloso describe a space of differet decimatio trees for a give FFT, usig Kroecker products with permutatio matrices ad twiddle-factor/vadermode matrices. They maitai that the complexity of moder processors makes it difficult to predict aalytically, or to model by had, the performace of a formula o a particular architecture. Also, that the differeces betwee curret processors lead to very differet optimal formulae from machie to machie. They employ a black-box approach by ruig the plaer software tool described i [7] o differet platforms, gatherig performace statistics. Their research reveals clear clusters i the histograms of cache miss couts vs. rutimes for each platform, as well as iterestig patters commo across the board. Also, they apply clever heuristics to limit the combiatorial burst of the solutio space, ad use particular decompositio trees. Their approach seems to focus o, ad work slightly better for, the similar to DFT but real-valued, Walsh-Hadamard Trasform (WHT) outside the scope of this work. Frigo ad Johso have addressed similar goals i the FFTW software (1998): it uses biary dyamic programmig to search for the optimal FFT implemetatio (see [8] ). I [] Ali et al. address schedulig o p cores, by factorig a -sized FFT, p. Earlier, [] co-author Johsso cotributed to the developmet of the popular plaer tool UHFFT. Usig it, the parallel FFT schedules i OpeMP ad PThreads are compared to that of the best sequetial FFT pla, ad the speedup for various umber of

17 Chapter 1: Itroductio 7 processors is reported. Reasoable speedup is achieved for sizes betwee 1 ad 14, o to 8 cores. However outside those sizes the speedup seems to follow Amdahl s law. I [7] Frachetti, Voroeko ad Püschel discuss parallelisig the FFT uder the assumptio of shared memory amog the cores. They maitai: The major problem with usig the stadard Cooley-Tukey FFT algorithm o shared memory machies is its memory access patter: large strides, ad cosecutive loop iteratios touch the same cache lies, which leads to false sharig. Their effort is thus aimed to fight false sharig. They describe their existig Spiral plaer tool, the propose extesios to it that allow for embarrassigly parallel (i.e. o mutual data depedecies exist betwee the threads) computatios of the FFT that also avoid false sharig. Their implemetatio appears to be the best fit for CMP: while o SMP platforms the performaces are comparable, the break-eve poit of parallelised FFT for Spiral o CMP is at size 7, while the competitio (FFTW ad Itel MKL) achieves it earliest at 14. A exotic algorithmic path is preseted i [13] : va der Hoeve argues that the stride from to +1 is too large, so trucate the FFT to obtai < +1 etries i the result vector. I [4] there is a good summary ad bibliography of techiques for efficiet twiddlefactor computatio. The mai approaches are CORDIC algorithms, polyomial approximatio of trigoometry, ad the recursive sie-fuctio geerator techique. CORDIC implemets fixed-poit arithmetic for butterfly rotatio (which is what a multiply by a twiddle-factor is) i fast embedded/fpga systems, virtually elimiatig the eed to compute ad store twiddle-factors separately; polyomial approximatio is commo i digital frequecy sythesis (DDFS); recursive sie-fuctio geeratio has

18 Chapter 1: Itroductio 8 accuracy issues, which the authors of [4] attempt to couter. The best result quoted is with the recursive sie-fuctio geerator adds ad multiplies, 4 FLOPs per etry. Fially, attempts are made to desig ew computig platforms so that they are also optimised for applicatios similar to the FFT. I [10] Guo et al. elaborate o desirable features of a uiversal multi-core processor with respect to its memory iterface. Although their requiremets are ot explicitly stated to favour the FFT, of the 7 tests carried out o their simulatio are FFT ad the essetially computatioally equivalet Iverse Discrete Cosie Trasform (IDCT). Aother two are Fiite Impulse Respose filter ad Adaptive Differetial Pulse-Code Modulatio coder - both sigal processig applicatios too, makig these more tha half of the tested oes. The most revolutioary idea proposed is the usage of cache for istructios oly, while data goes ito local memory for each core, similar to the CellBE processor (except the latter uses local memory for istructios too). 1.3 Motivatio for This Research Icremetality of Importat Applicatios Most sigal processig applicatios ivolve samplig at regular itervals. Prevailig research so far has cocetrated o quickly ad otherwise efficietly computig the FFT, oce all iput data is i mai memory (all samples have bee take). Iterestigly, the twiddle factors, costat for ay give problem size, are almost ever kept i storage or a database (except i some embedded/fpga applicatios), eve whe the algorithm oly deals with power-of-two sizes. Istead, API is provided to

19 Chapter 1: Itroductio 9 calculate them, the idea beig that the programmer calls this oce the uses the values i repeated Fourier trasforms of the same size. Herei we assume that there exist sigal processig applicatios where the problem size is ot kow before the arrival of some samples. While this may ot always be the case, it seems like a terrible waste ayway to stay idle computatioally while samplig, ad util the last sample arrives. If the latter were to cotai a setiel tag idetifyig it as the last (a atural assumptio), eve the twiddle factors will ot yet have bee computed, which could have happeed if the problem size had bee kow i advace. We preset a parallelisable method to work icremetally, as the samples arrive Multi-core specifics What remais to be addressed is parallelisatio. Earlier research has exclusively addressed it via decimatio of FFT ito smaller sizes, each assiged to a separate core. That may be the oly reasoable approach whe may parallel processor uits are available. However recetly multi-cores of to maximum 8 uits have become popular. O them, the prevailig approach is to search experimetally a space of (usually six-step) solutios lookig for the best executio time of a specific size FFT o a particular platform. It would be iterestig to explore other load-balacig strategies (but ot excludig decimatio, if eough cores are available). Pease s algorithm ([15] ) is a cadidate, with the perfect shuffles ruig parallel o a separate core, as well as parts of the butterfly operatio. Cores could progress i parallel performig asymmetrical computatio. This allows for the simplest ad most efficiet sychroisatio mechaism i.e. spi-lock.

20 Chapter 1: Itroductio Goals of This Research Approach ad Methodology It is oly atural to explore computig of the FFT efficietly, i the followig sese: the overall work complexity is asymptotically optimal i.e. O(N log N) (research exists that argues i favour of Horer s rule, O(N ), for practical low N); the arithmetic complexity is at least as good as that of the origial Cooley-Tukey algorithm, preferably the more efficiet Split-radix (o lower boud is kow); the I/O complexity i cache misses is close to optimal (o lower boud is kow), asymptotically; assume the size fits i memory but ot cache (ot hard up to 3 ); the algorithm is cache-oblivious; SIMD, MAC ad other pipeliig features of moder CPUs are used effectively; the work is parallelisable ad reasoably load-balaced o available CMP multicore, for the customary umbers of cores up to the 9 preset i IBM s CellBE. At the curret stage of the research area it is uclear whether the shoppig list above is achievable i full, ad uder what assumptios eve if a dedicated chip is desiged. Actually Pease ([15] ) origially suggested his algorithm for the desig of dedicated hardware. The top two bullets are easiest to achieve a variety of decimatio strategies are kow, for highly composite sizes or powers of two, as well as for prime sizes ad co-prime factors of the size. The ext three oes have bee researched mostly separately. The last bullet looks deceptively easy, but is ot trivial if the rest are kept i mid. A sizable proportio of the recet papers is dedicated to plaers that fid a optimal solutio, for a give problem size, o a particular platform, from a space of possible oes. For this to be practical, the applicatio is assumed to be of kow size, ad oce

21 Chapter 1: Itroductio 11 optimised will be ru multiple times o the same platform. Although these assumptios are fair, other applicatios are also coceivable i sigal processig ad measuremet. We take a itegrated approach: cosider all adjacet areas as well as the expected timig of evets aroud the FFT computatio. We also suggest asymmetrical executio of the parts e.g. shuffles, butterflies ad twiddle-factor multiplicatio, o separate cores. Thus each core will ru a simpler algorithm, which is a prerequisite for better pipeliig ad compiler optimisatios Specific Solutio for Computatio of Twiddle-factors We address the problem of fidig eve more efficiet ways to compute the twiddle factors tha the oes already kow, specifically for power-of-two sizes. We show a high-accuracy method (data structure ad algorithm) to improve o the adjacet area of computig the twiddle factors; our ew method also perfectly agrees with icremetality Specific Solutios for Icremetality I FFT every iput value of the algorithm affects every output value. Ca we still save time uder the assumptio that the applicatio is icremetal? Our aswer is yes if we choose to re-order data to improve space locality (ad to simplify the algorithm), that (re-orderig) work is already half doe at every power-oftwo boudary. The we come up with the otio of iput isertio, whereby a iput value arrivig after a trucated FFT has already bee computed, is propagated ito the solutio, istead of doig the same power-of-two size FFT all over agai.

22 Chapter : Itroducig the Fast Fourier Trasform.1 Basic defiitios.1.1 Polyomials 1 j A polyomial i the variable x over a algebraic field F is A ( x) a j x, a j F. The values a 0, a 1,..., a 1 are called the coefficiets of the polyomial, typically draw from the field C of the complex umbers. Ay iteger that is strictly greater tha the degree of a polyomial is a degree-boud of that polyomial. The degree of a polyomial of degree-boud may be ay iteger betwee 0 ad 1, iclusive. If A(x) ad B(x) are polyomials, their sum is defied as a polyomial C(x) of same degree-boud, such that C(x) = A(x) + B(x), for ay x. The coefficiets of matchig degrees are added together computatioal complexity is O() for degree-boud. Similarly, if A(x) ad B(x) are polyomials of degree-boud, their product C(x) is a polyomial of degree-boud 1 such that C(x) = A(x).B(x) for ay x. This meas multiplyig each term i A(x) by each term i B(x) ad addig those with equal powers; this is called the covolutio of the iput vectors a ad b, deoted c = a b. The above process is O( ), coutig all arithmetic operatios o the terms coefficiets. j 0

23 Chapter : Itroducig the Fast Fourier Trasform Alterative Represetatios of Polyomials The represetatio from the defiitio is called the coefficiet represetatio. It is coveiet for some operatios o polyomials, e.g. additio as above. Also, the operatio of evaluatig the polyomial A(x) at a give poit x 0 cosists of computig the value of A(x 0 ). Evaluatio takes time Θ() usig Horer's rule: A(x 0 ) = a 0 + x 0 (a 1 + x 0 (a + + x 0 (a - + x 0 (a -1 )) )). A poit-value represetatio of a polyomial A(x) of degree-boud is a set of poit-value pairs {(x 0, y 0 ), (x 1, y 1 ),..., (x 1, y 1 )} such that all of the x k are distict ad y k = A(x k ), for k = 0, 1,.., 1. A polyomial has may differet poit-value represetatios; evaluatio meas fidig oe of these, of size at least the degree-boud. The poit-value represetatio is as coveiet for multiplyig polyomials, as for addig them. If C(x) = A(x) B(x), the C(x k ) = A(x k ) B(x k ) for ay poit x k, ad we ca poit-wise multiply a poit-value represetatio of A by a poit-value represetatio of B to obtai a poit-value represetatio of C. The iverse of evaluatio the determiig of the coefficiet form of a polyomial from a poit-value represetatio is called iterpolatio. Theorem (Uiqueess of iterpolatig polyomial): For ay set {(x 0, y 0 ), (x 1, y 1 ),..., (x 1, y 1 )} of poit-value pairs such that all the x k values are distict, there is a uique polyomial A(x) of degree-boud such that y k = A(x k ) for k = 0, 1,..., 1.

24 Chapter : Itroducig the Fast Fourier Trasform 14 The proof is based o the existece of the iverse of the Vadermode matrix V ( x 0, x 1,..., x )... 1 x x x x x 0 x x 0 1 x1 o-sigular as x k are distict by defiitio x 1 A fast algorithm for -poit iterpolatio is based o Lagrage's formula: A ( x) 1 jk ( x x ) k0 ( xk x j ) jk j. It is possible to compute the coefficiets of A usig Lagrage's formula i time Θ( ): compute P ( x k x j ), the the coefficiet represetatio of jk j k ( x xl ) ( xk x j ) ( x ), the divide it by x j j k for each l = 0, 1,, 1. Thus, -poit evaluatio ad iterpolatio are well-defied iverse operatios that trasform betwee the coefficiet represetatio of a polyomial ad a poit-value represetatio. The methods described for these are Θ( ) classical arithmetic operatios, where is the degree boud of the polyomial..1.3 Sigals ad Frequecies Electroics mixes sigals by addig them or multiplyig/dividig them. Commo practical problems i sigal processig ivolve aalysis of the spectrum: if frequecy f is preset, how strog is f, 3f, etc (called harmoics). Note: a perfect square wave of a digital sigal, alteratig betwee some voltage to represet a biary 1, ad ~0V to represet a biary 0, cotais a ifiite series of f, 3f, 4f,!

25 Chapter : Itroducig the Fast Fourier Trasform 15 This meas fidig the coefficiets of terms of a polyomial (iterpolatio), sice f is represeted by the square of a complex umber, 3f by the cube, etc. This is based o Euler s idetity: cos x + i si x = e ix, as the equality e ky = (e y ) k holds for ay complex y..1.4 Discrete Fourier Trasform (DFT) The iverse of the particular iterpolatio whe f = 1/ is to evaluate the polyomial 1 A( x) a j x j0 j of degree-boud at the complex th roots of uity:,,...,, where e i. Without loss of geerality (WLOG), assume that is a power of, sice a give degree ca always be raised high-order zero coefficiets ca always be added as ecessary. Let A be give i coefficiet form a = (a 0, a 1,..., a 1 ). Defie the results y k, for k = 0, 1,..., 1, by y k 1 j 0 k kj A( ) a. j Defiitio 1: The vector y = (y 0, y 1,..., y 1 ) is the Discrete Fourier Trasform (DFT) of the coefficiet vector a = (a 0, a 1,..., a 1 ). We also write y = DFT (a). The DFT is equivalet to the cotiuous Fourier trasform F( s) for periodic f, i a bad-limited settig made discrete via samplig. f ( t) e F(s) would yield the stregth of frequecy s i the mix its amplitude. A differet way to express the DFT is y = V a, where V is the Vadermode matrix with the powers of the -th complex root of uity. To carry out the iverse, e.g. iterpolate for frequecy aalysis, V -1 is eeded. I [6] there is the followig theorem: i.st dt,

26 Chapter : Itroducig the Fast Fourier Trasform kj/ Theorem: V j,k = ω Proof idea is, by usig the properties of the complex roots of uity, to show that V -1 V is the idetity matrix I. The above shows the iverse to be similar to DFT. DFT has bee prove to be its ow iverse, reordered ad scaled with a factor of 1/..1.5 Properties of the Complex Roots of Uity The descriptio of these properties follows the presetatio i [6]. From the defiitio of complex roots of uity directly follows the followig lemma: Lemma 1: (Cacellatio lemma) For ay itegers 0, k 0, ad d > 0, d dk = k. Proof: d dk = e i.dk/d = e i.k/ = k. Corollary For ay iteger > 0, = = 1. Lemma : (Halvig lemma) If > 0 is eve, the the squares of the complex th roots of uity are the / complex (/) th roots of uity (each occurrig twice). Proof: / = 1 implies k+/ = k hece ( k+/ ) = ( k ). However per the Cacellatio lemma ( k ) = / k. Lemma 3: (Summatio lemma) k j For ay iteger 1 ad oegative iteger k ot divisible by, ) 0. 1 k k k k j 1 ( ) 1 ( ) 1 (1) Proof: ( ) 0. k k k j 0 j 0 1 (

27 Chapter : Itroducig the Fast Fourier Trasform Fast Fourier Trasform (FFT) ad the Cooley-Tukey Algorithm By takig advatage of the properties of the complex roots of uity, DFT (a) ca be computed i time Θ( log ), as opposed to Θ( ) for the defiitio formula. This method is attributed to Gauss, but was rediscovered as the Cooley-Tukey Algorithm i [5]. I the two cases below we use (ω k ) = ω k / (remember WLOG is a power of ). Idea 1: Split the eve-idex from the odd-idex coefficiets of polyomial A(x). Assumig eve: A [0] (x) =a 0 +a x+a 4 x + +a - x / ; A [1] (x) = a 1 +a 3 x+a 5 x + +a -1 x / 1, ad A(x) = A [0] (x ) + xa [1] (x ). Thus we eed to evaluate two degree-boud / polyomials at (ω 0 ), (ω 1 ),, (ω 1 ) the complex (/) th roots of uity each occurrig twice the to combie the results. The problem decomposes ito two of half its size. This method is referred to i literature as Radix- decimatio i time (DIT). The iverse of this odd/eve split is a operatio kow as perfect shuffle (like with two half-decks of cards). Idea : Split the low-idex half coefficiets of A(x) from the high-idex half oes: A(x) = a 0 + a 1 x + a x + + a /-1 x /-1 + x / (a / + a /+1 x + a /+ x + + a -1 x /-1 ) 1 j i.e. A ( x) ( a j x a ) j x ; thus for r = 0,1,, 1, yr ( a j0 1 j0 j r a ) j jr. Cosider the eve r=k separately from the odd r=l+1; takig ito accout ω / = 1: z k 1 1 kj k a j a j ) ( a j a j ). j0 j0 y (, for k 0,1,..., 1, ad kj t l 1 1 lj j j l 1 a j a j ) ( a j a j ). j 0 j 0 y (, for l 0,1,..., 1. lj

28 Chapter : Itroducig the Fast Fourier Trasform 18 Agai, the DFT problem decomposes ito two of half its origial size. This method is referred to as Radix- decimatio i frequecy (DIF). Radix- DIT ad DIF are particular cases of the geeral Cooley-Tukey algorithm, which allows ay radix that divides..1.7 Recursive Algorithm for Radix- DIT FFT The pseudo-code below follows DIT literally, hece its correctess is iheret: it begis with a check for the ed of the recursio; the a iverse perfect shuffle o the iput is performed ad the result is assiged to ew 0-based vectors a0[] ad a1[]. Recursive_FFT() the calls itself for these. Fially, a for-loop iterates icremetally calculatig the powers of the root of uity at the same time combiig a0[] ad a1[]. Recursive_FFT(a[0:-1], ) /* a[ ] is a vector, is power of */ if = 1 the retur a[]; edif; a0[0:-1] = {a[0],a[],...,a[-]}; /*vector assigmet, a*/ a1[0:-1] = {a[1],a[3],...,a[-1]}; /*iverse perfect shuffle*/ y0[0:-1] = Recursive_FFT(a0,/); y1[0:-1] = Recursive_FFT(a1,/ ); ω = e πi/ ; /* primitive complex root of uity twiddle-factor */ ω = 1; for k = 0 to /-1 do y[k] = y0[k] + ω*y1[k]; y[k+/] = y0[k] - ω*y1[k]; ω = ω * ω; /* computatio of twiddle-factors */ edfor; retur y;

29 Chapter : Itroducig the Fast Fourier Trasform 19 I the computatioal algorithms, the complex roots of uity have become kow as Twiddle-factors, sice they are beig viewed as coefficiet correctios, e.g. i t l above. The computatioal complexity of Recursive_FFT() is O( log ). Ideed there are log recursio levels of iteratio loops each x / the 4 x /4, etc. We show i sectio 3..7 that the computatioal complexity of DIF versios of Cooley- Tukey is also O( log )..1.8 Towards Iterative Algorithm for Radix- DIT FFT The code above is recursive, with overheads for the calls/returs ad local vectors. Also, the value ω*y1[k] is computed twice, whe added ad whe subtracted. It could be assiged to a variable the reused with the sig reversed; this is kow as a butterfly operatio. Follow the exact order of the recursive evaluatio, idices expressed i biary: a[0] a[] a[4] a[6] a[1] a[3] a[5] a[7] a[0] a[4] a[] a[6] a[1] a[5] a[3] a[7] The above are a case of bit-reversal permutatio: biary umbers sorted by the leastsigificat (LS) bit first. They are easy to compute i sub O( log ), thus would ot affect the complexity. It is the straightforward to write a iterative versio of the above algorithm, refer to [6] for suggested implemetatio.

30 Chapter : Itroducig the Fast Fourier Trasform Applicatios of the FFT Possible applicatio is digital implemetatio of AM or FM radio. Preseted i Figure.1 ( Arrow Electroics) is Amplitude Modulatio (AM) of carrier 14kHz by 1kHz. The 1kHz amplitude over time ca be foud (demodulated) via spectrum aalysis usig FFT. Figure.1. Amplitude Modulatio At a DSP traiig course we were preseted with a software traffic radar detector too. Comig back to polyomials, straightforward multiplicatio (covolutio of coefficiet vectors) was show to be O( ). Alteratively (per Schöhage & Strasse): Evaluate both polyomials at complex roots of uity, usig FFT O( log ); Poit-wise multiplicatio of the two PVRs O(); Iterpolate the result usig iverse FFT O( log ); Overall ruig time is therefore O( log ). Oe importat applicatio is the efficiet multiplicatio of large prime umbers eeded i cryptography: the decimal represetatio of a iteger ca be viewed as a polyomial with its digits as the coefficiets. I moder wireless commuicatio systems, both for voice ad widebad data trasmissio, the OFDM (Orthogoal Frequecy Divisio Multiplexig) is used. It also plays a importat role i wire-lie commuicatio systems. Examples of widely popular stadards relyig upo it are 80.11a/g, 80.16, DVB, DAB, VDSL, ad so o.

31 Chapter : Itroducig the Fast Fourier Trasform 1 I these systems, DFT/iverse DFT, implemeted as FFT/iverse FFT both i software ad hardware, is the core compoet i OFDM trasmissio ad receptio. Those systems require FFT/IFFT of legths ragig from 64 to 819. Thus the FFT eeds to be evaluated o the widest possible variety of computig platforms FFT o Multi-core Platforms The majority of today s computers are multi-core with more tha oe processor/ arithmetical-logical uit (ALU) o a sigle chip. May use truly shared memory that ca be accessed from ay core, some like IBM s CellBE do ot their data eeds to be copied by Direct Memory Access (DMA). Some use private cache with shared memory, hece the false sharig problem data may be preset i the wrog core's cache, where it is ot eeded but takes the space of data that is eeded for the computatio.. Theoretical Arithmetic Complexity of the FFT Asymptotically FFT for size N takes O(N log N) operatios. Extesive research has bee carried out o the actual operatios cout, i.e. o the costat factor before N log N (kow to be 5 for Cooley-Tukey), as well as the remaiig (o-domiatig) terms of the complexity equality. The arithmetic complexity is expressed i the umber of floatig poit operatios (FLOPs), additios ad multiplicatios, as a fuctio of the problem size N. No tight lower limit is kow. Util recetly (007), the best kow result had bee achieved by Yave i 1968, his split-radix algorithm ruig i 4NlgN - 6N + 8, where lg meas log. I [14] Johso ad Frigo give a explaatio of that result, ad publish a improved cout of 34/9NlgN - 14/7N - lgn - /9(-1) lgn lgn + 16/7(-1) lgn + 8,

32 Chapter : Itroducig the Fast Fourier Trasform also based o the split-radix method: decimatio ito 3 sub-problems, oe of which cosists of the eve-idexed terms, the other two respectively idexed 1 ad 3 modulo 4. There is also research available that focuses o miimisig the floatig-poit multiplicatios, but this is achieved with a lot more additios. Such results could be useful o some platforms, particularly dedicated sigal-processig FPGAs or processors (icludig some DSPs) that do ot support floatig poit arithmetic i hardware. For ormal processors, especially more recet oes, the arithmetic complexity i the origial sese (multiplicatios ad additios together) is more relevat sice a multiplicatio takes the same umber of processor cycles as additio. However experimets have show that performace for the same theoretical FLOP cout ca differ dramatically depedig o data space locality, pipeliig, ad other features of moder processors. This is the topic of the ext sectio. The twiddle-factors are mostly assumed (efficietly) pre-computed the reused. Efficiet computatio of these has focused maily o avoidig repeated calculatios ad usig some properties of the complex roots of uity like their periodicity: e.g. it is sufficiet to calculate the oes i the 1 st Cartesia quadrat the others are obtaied via multiplyig by i, 1 or i. These multiplicatios do ot ivolve FLOPs, as they ca be expressed as sig iversios ad swappig of the real part with the imagiary part. It is coveiet to calculate the twiddle factors oce, store the values ad reuse them with ew FFTs of the same size. Almost all existig software libraries provide API to populate a array of twiddle-factors give the FFT size, istead of pre-computed tables.

33 Chapter : Itroducig the Fast Fourier Trasform 3.3 Decimatio, Parallelism ad Multi-core Both DIF ad DIT allow divide-ad-coquer parallelisatio, also apply to size factors, ad algorithms exist for prime size (Rader s) or co-prime size factors (Good-Thomas). For large umbers of processors, e.g. o computig arrays or hyper-cubes, strategies have bee developed to distribute work equitably (load-balacig). O smaller CMP multicores, efforts have bee applied mostly to improve the use of caches ad pipelies istead the bit-reversal permute ad the stride a 0 to a / create a problem i the presece of cache: they dramatically decrease speed for FFT sizes that do ot fit i it. Oe solutio is the Six-step approach ([9], [7] ): decimate size =pq ito a pxq matrix, each row with coefficiets of a size-q FFT, ad compute the origial FFT i these steps: 1) Traspose more cache-efficiet due to stride smaller tha /; ) Perform FFT by rows more cache-efficiet due to size much smaller tha ; 3) Combie results with twiddle-factors of much lower degree; 4) Traspose; 5) Perform FFT by rows, this combies the parts of the decimatio; 6) Traspose. Most kow moder solutios ivolve techiques to address both parallelism ad locality. Some achieve it through a dedicated plaer ru to fid oe efficiet solutio from amog a space of may, for a particular problem size ad (multi-core) platform..3.1 Pipelies ad Sigle Istructio/Multiple Data (SIMD) Space locality is ot the oly feature to be cosidered o moder computig platforms. May, especially the oes dedicated to efficiet computatios such as DSPs, have bee optimised for certai predictable patters of istructios ad data that occur i time.

34 Chapter : Itroducig the Fast Fourier Trasform 4 For example a commo characteristic is the efficiet carryig out of a sequece of multiply-ad-accumulate (MAC) ito a sum s: s s + a i. b i, for a sequece of values i, or the ability to perform with higher efficiecy the same operatio o a relatively small array/vector of values vectorizatio a.k.a. Sigle Istructio Multiple Data (SIMD). Aother oe is istructio pipeliig: phase of istructio #i rus i parallel to phase 1 of istructio #i+1, so circuitry parts do t wait idle similar to a coveyor belt. Sometimes i research the term of time locality is used to deote ay of the above..3. Stockham Autosort ad Pease s Algorithm Certai algorithms have bee prove to fit well with space or time locality. Oe favoured method with SIMD is the Stockham Autosort (1966). Alteratively to the 6-step, it embeds the reverse-bit permute ito the butterfly: whe computig a FFT decimatio stage, the results go ito locatios of a itermediate workig array, determied by trasposig (flippig) bit positios i the idex s biary represetatio. Stockham is called auto-sort sice o separate re-orderig/traspose step is required. Pease s algorithm (1968) promises eve greater SIMD advatages but requires a separate perfect-shuffle permute stage ad O(N log N) auxiliary storage. I [15] he itroduces Kroecker product otatio to express ay permute via matrix traspositios. Stockham efficiecy deteriorates dramatically whe size exceeds 1/3 rd of the cache; ideed apart from the iput array, for each phase it eeds two alteratig workig storage arrays of the same size as the iput array, plus more for temporary variables etc.

35 Chapter 3: Geeric Desig This chapter itroduces our approach to the solutios, ad describes these i detail. 3.1 The Solutios Itroducig the Middle-Heap Comig back to twiddle factors, we accept that it is uwise to store them permaetly, for all possible problem sizes. We argue however that we ca compute them icremetally for the most widely used sizes powers-of-two. First defie array H 3: H 3 = {-1; i, i}; at step >1, usig array of size -1, compute the +1-1 array s etries of idices to +1 (the odd powers of 1 ) possible to do parallel to samplig. So H 7 = ={-1; i,-i; 8 i, 8 i, 8 i, 8 i }, etc. The above requires a matchig data structure: cosider a tree that grows by doublig its size with each ew level added. It is most appropriate to use a heap (see [6] ) sice the uderlyig storage costruct is a array. The heap is a biary tree completely filled with elemets, except possibly at the level of the leaves, stored i a array. The root is stored at idex 1 of the array; if a paret has idex i, its left child has idex i, ad its right child has idex i+1. Each elemet is, or

36 Chapter 3: Geeric Desig 6 has a key value from a partially or totally ordered set. Whe all parets are larger tha their childre, the heap is a max-heap. Here is a example from [6] : A (max-) heap viewed as (a) a biary tree ad (b) a array. The umber withi the circle at each ode i the tree is the value stored at that ode. The umber above a ode is the correspodig idex i the array. Above ad below the array are lies showig paret-child relatioships; parets are always to the left of their childre. The tree has height three; the ode at idex 4 (with value 8) has height oe. Figure 3.1. Max-Heap Similarly, a mi-heap is the same structure but parets are smaller tha their childre. Defiitio : A array H of size N i which for ay atural i N the followig holds: if i N the H[i] H[i], ad if i < N the H[i] H[i+1], is called a mi-heap. Whe either of them exists withi the array H, H[i] is defied as the left child of H[i], ad H[i+1] is defied as the right child of H[i] i the tree of mi-heap H; H[i] is by defiitio the paret of H[i] ad H[i+1]. H[1] is the tree root of mi-heap H. A mi-heap H is complete whe its size N is of the form N = 1 for some > 1. Give the idex i, the idices of its left child LEFT(i), right child RIGHT(i) ad whe the ode is o-root of its paret PARENT(i), are computed by pseudo-code as follows: PARENT(i) retur i/ LEFT(i) retur i RIGHT(i) retur i + 1

37 Chapter 3: Geeric Desig 7 For the complex roots of uity p, let partial order be ascedig by degree, ad total order be by the (for two roots with equal degrees ) ascedig by p. Defiitio 3: A complete mi-heap of size N = 1 cotaiig all the twiddle-factors smaller tha or equal to ω N+1N is called middle-heap of size N, deoted H N. We use oly middle-heaps sorted per the total order defied above, as i Figure 3.: -1 i -i H 3 i i i i H7 i e 16 3 i e 16 5 i e 16 7 i e 16 9 i e i e i e i e 16 H (H 31 ) Figure 3.. Middle-Heap 3.1. Icremetal Computatio of Twiddle Factors the Mea-middle Method Whe the H N middle-heap is i-traversed, the twiddle factors are retured i ascedig powers of the N+1-th primitive root of uity, except the trivial oe N+1 N+1 =1. For example i H 7 (Figure 3.), by startig from the left-most leaf i i =, the its e 8 paret i = i e 8 i, followed by its siblig i =, the up the tree util the root 3 e 8 1 = 4 i e 8 the dow left the right sub-tree of the root to the leaf 5 e 8 i i =, etc. Note that i H 7 the leaves have this property: each leaf except the left-most is the mea average of two adjacet traversed values from H 3, scaled with coefficietc 3 :

38 Chapter 3: Geeric Desig 8 i ( 1) ( 1) ( i) i 1 i = c 3, i = c3, i = c 3. The left-most leaf is 1 i i = c3. Thus we ca work cyclically from 1 ad the first i-traversed value, the drop the 1 ad combie the first with the secod i-traversed value, the cotiue i the same way, util i the last step 1 is used agai; proof follows. Cosider Figure 3.3; m 8 ad m 16 are the midpoits of the respective segmets. Figure 3.3. Meas of Adjacet Twiddle Factors 3 8 i 8 m 8 m i 8 7 The followig equalities hold: 8 = c 3 m 8 = i e 16 1 i c3, where 8 = e 8 i i, c 3, i= 4 = e 4 ; = = c 4 m 16 = c4 for c 4 = m, where 1 m 1 cos si Similarly, 3 8 i 16 = c4 (ote that 16 = 8 ), etc. k k 1 k 1 Lemma 4: For ay atural k ad costat c +1, such that c 1 1. Proof: Sice 1, 1 is colliear with its argumet s agle bisector. So is 1, sice per Halvig lemma ( 1 ad multiplicatio by 1 ) is rotatio by its argumet. Deote c 1 1. Cosider c 1 :

39 Chapter 3: Geeric Desig 9 its module is 1 i.e. equal to that of 1. Sice it is also colliear to 1, ad both are i the first quadrat, they are equal: 1 k c 1 1. Now multiply both sides by : 1 k c 1 k 1 c 1 k k 1 k k k 1. Fially, We ca compute the leaves for the ext Middle-heap degree (the odd powers of the primitive root of uity for the ext power of two) by fidig mea averages of cosecutive traversed values from the middle-heap of the previous degree, ad multiplyig by a real-value costat c. The costat ca be calculated oce for each size. This allows us to work out each twiddle factor roughly with oe complex i.e. two real-valued additios, ad two (real-valued) multiplicatios; add the O(1) time of workig out a costat, for each ext power-of-two cout. We igore the divide by two i computig the mea average it is a cheap right shift of the matissa, or (eve better for loss of precisio) decremet of the floatig poit value s (biary) power. The ca be hidde i the costat, but below we will ote a beefit if c is close to 1. Note this calculatio ca be carried out idepedet of the Middle-heap data structure; i sectio we show that the Middle-heap traversal takes liear time with a small costat. Call this the Mea-middle method. Its arithmetic complexity is better tha that of ay other method of computig twiddle factors. This is a major cotributio of this work. We also show that due to its biary-chop patter, the precisio of calculatig the twiddle factors i this way is also superior to other methods kow so far.

40 Chapter 3: Geeric Desig Load-Balacig for Icremetality Our approach is that cores specialise i parts of the FFT computatio. This is ituitively expected to improve time locality/simd, overall. There is work for four cores at least: C1. Samplig ad Data Coversio; C. Twiddle Factor Computatio; C3. Shuffle, or Matrix Traspose (for pxq decimatio, if six-step used sectio.3); C4. Butterfly Operatio ad ay other FLOPs; Uder the premise that the computatio is icremetal, ad progresses i parallel with the samplig process, from amog the variats of the classical Cooley-Tukey algorithm, Decimatio i Frequecy (DIF) is the better cadidate tha Decimatio i Time (DIT): the former allows for the computatio to proceed with the lower half of samples, while the upper half is ot yet available. All four cores progress i parallel ad ay core except the first may have to wait for its previous oe to complete a part of its work, i order to make available some data eeded for its computatio. It appears wise for C3 to wait o C before sigallig the release of C4, rather tha C4 for both C ad C3; cascadig sychroisatio is simpler. The sychroisatio mechaism suggested herei is spi-lock: the overhead is miimal, due to o cotext-switchig. That said, platform features are welcome, like the register files for fast switchig of two threads i the PPE core of the CellBE processor. Note that the work of C1 is ot trivial: samples are ormally take i fixed-poit arithmetic from Aalog-to-Digital Coverters (ADCs) ad each eeds to be coverted to floatig-poit accordig to some fuctio, i most cases but ot always a liear oe.

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As