MOST of the advanced signal processing algorithms are

Size: px

Start display at page:

Download "MOST of the advanced signal processing algorithms are"

Shona Hubbard
5 years ago
Views:

1 A edited versio of this work was publiched i IEEE TRANS. ON CIRCUITS AND SYSTEMS II, VOL. 6, NO. 9, SEPT 05 DOI:0.09/TCSII High-Throughput FPGA Implemetatio of QR Decompositio Sergio D. Muñoz ad Javier Hormigo Abstract This brief presets a hardware desig to achieve high-throughput QR decompositio, usig Gives Rotatio Method. It utilizes a ew two-dimesioal systolic array architecture with pipelied processig elemets, which are based o the COordiate Rotatio DIgital Computer () algorithm. computes vector rotatios through shifts ad additios. This approach allows a cotiuous computatio of QR factorizatios with simple hardware. A fixed-poit FPGA architecture for matrices has bee optimized by balacig the umber of iteratios with the fial error. As a result, compared to other previous proposals for FPGA, our desig achieves at least 50% more throughput, ad much less resource utilizatio. Idex Terms QR Decompositio, systolic array, pipelied, FPGA, high-throughput, I. INTRODUCTION MOST of the advaced sigal processig algorithms are based o algebraic matrix operatios. May examples of this are foud i wireless commuicatio, such as multipleiput-multiple-output (MIMO), beam-formig, multi-user detectio ad cacellatio, etc []. Oe useful operator for these matrix operatios is QR factorizatio, especially for MIMO techologies [] [3] ad adaptive filterig []. Some of this applicatios require high-throughput QR decompositio but are for small matrix sizes. Thus, may works have addressed the parallel hardware implemetatio of this operatio for either ASIC or FPGA techologies. I this work, we focus o high-throughput computatio for small matrices o FPGAs. The Gives Rotatio Method (ad its variatios) is probably the most widely used to implemet QR decompositio by hardware due to its robust umerical properties ad its easy parallelizatio [5]. I the literature, there are several papers i which QR factorizatio has bee implemeted o FPGA by usig this method. Although, serial approaches or liear systolic arrays may be used [6], to achieve high throughput, the most commo hardware implemetatio is through twodimesio (D) systolic arrays, such as i [7], [8], [], [9], [0], []. A D systolic array is a parallel grid structure where processig elemets (PEs) works i parallel ad are locally itercoected. This systolic architecture allows the exploitatio of differet grades of parallelism iheret to the the Give Rotatio algorithm. Thus, these approaches have This work was supported i part by the Miistry of Educatio ad Sciece of Spai ad Juta of Alucía uder cotracts TIN03-53-P ad P07-TIC-0630, respectively. The authors are with the Departmet of Computer Architecture, Uiversidad de Málaga, Málaga E-907 Spai ( smuoz@uma.es; fjhormigo@uma.es). Copyright c 05 IEEE. Persoal use of this material is permitted. However, permissio to use this material for ay other purposes must be obtaied from the IEEE by sedig a to pubs-permissios@ieee.org high-throughput ad relatively low latecy, at the cost of cosiderable area cosumptio. I this work, through combiig several ideas, we have desiged a ew architecture which improves previous highthroughput FPGA implemetatios. It is based o the algorithm to simplify hardware, pipeliig the PEs to obtai better throughput, alog with a differet schedule for performig the Give Rotatios to reduce latecy. As a result, the proposed architecture has very high-throughput ad low latecy, with a relatively reduced area cosumptio. They also have a very simple cotrol ad commuicatio logic. The ext sectios of this brief are orgaized as followed: Sectio II reviews some importat aspects of the QR decompositio usig Gives Rotatios, alog with a brief review of some previous works proposed i the literature. Sectio III presets the proposed architecture to achieve high-throughput. I Sectio IV the results of the FPGA implemetatio are studied ad compared with other previous works. Fially, Sectio V provides the coclusios of this work. II. GIVENS ALGORITHM AND PREVIOUS FPGA IMPLEMENTATIONS Give a matrix A m, this is equivalet to the product of two factors, i. e. A = Q R, i which matrix Q m m is orthogoal ad R m is a upper triagular matrix [5]. The computatio of these two factors is called QR decompositio or factorizatio. The Gives Method achieves a QR factorizatio through uitary trasformatios, called Gives Rotatios, which selectively allow the itroducig of a zero elemet [5]. Gives rotatio matrix has rak-two correctios about idetity matrix, where the rak (i, j) is replaced by orthogoal values based o sies ad cosies. [ cos(θ) si(θ) si(θ) cos(θ) ] [ ] ] a a =[ a 0 As a example, a Gives rotatio is represeted i Eq. for a matrix, where the resultat matrix has a ew iserted zero; this ca be extrapolated to ay other matrix size. The rotatio agle θ must be computed beforehad by the formula arcta( a a ). Alteratively, these values ca also be calculated by Eq. ad Eq. 3. cos(θ) = a i,k a i,k + a j,k () si(θ) = a j,k a i,k + a j,k Accordigly, Gives Method algorithm starts zeroig the lower elemets, from the first colum to the last oe, ad, o each colum, startig from the bottommost elemet to the () (3)

2 STAGE STAGE STAGE 3 STAGE G(3,) G(,3) G(,) STAGE 5 STAGE 6 R R3 3 R 5 R R R3 R R G(3,) G(,3) G3(3,) x Fig. 3. Row-based D-systolic array for matrices. *Gk(i,j)=Gives Rotatio of rows i ad j, k is the colum where a zero is iserted. Fig.. Fig.. Usual Gives rotatio schedule for matrices. GR(,3) GR(3,) R R R3 R Q Q Q3 Q V R R R R R R R V R R R R R R V R R R R R V R R R R Colum-based D-systolic array for matrices. diagoal elemet. The upper triagular matrix R is achieved by accumulatig the Gives Rotatios o the iitial matrix. Similarly, Q is obtaied whe the same rotatios are applied to the idetity matrix. As a example, Fig. illustrates the applicatio of the Gives Method o a matrix to achieve a upper triagular matrix R i 6 stages. Each arrow represets a Gives Rotatio, where G k (i, j) specifies the ivolved rows (i, j) ad the colum k where a zero will be iserted. The circular areas idicate the elemets selected to calculate the rotatio agle, whereas the squared areas delimit those elemets that will be rotated usig said agle. It is clearly see that this algorithm has differet levels of parallelism that cou be exploited depedig o the selected architecture. Several works, such as i [0] ad [7], have proposed a D systolic array similar to the oe showed i Fig.. I this architecture, each PE always works over elemets o the same colum. O each row of the architecture, the Gives rotatios may be performed i parallel usig as may PEs as o-zero elemets are withi the row of the matrix. Besides this parallel computatio, this cofiguratio has the advatage of the two differet types of PEs used, oe (V) to compute the rotatio agle which, at first, requires much more complicated operatios, ad aother (R) to perform the effective rotatios, which is much simpler. Each row of PEs oly eeds oe PE type V ad the rest as type R. Thus, although they eed more PEs, the umber of PEs type V (much more complex) are reduced ad, the, the overall area may be also reduced. This architecture is used i [7], where stard arithmetic operatios are utilized to implemet the PEs. I [0], iterative circuits are used istead, which reduces area cosumptio. However, i these approaches, due to data depedecies betwee cosecutive rotatios, the PEs of the last rows are idle most of the time, which meas a importat waste of resources. Besides this, the same data depedecy prevets the use of pipeliig iterally i the PEs, which limits the achievable throughput. A differet ad oer approach is the oe used i [8] ad [] which is show i Fig. 3. O this scheme, a PE completely performs a Gives rotatio for all elemets of the two rows. Thus, the two operatios ivolved i a Gives rotatio have to be combied i oe PE, makig it more complex, although much fewer PEs are required. The mai advatage of this approach is the fact that the oly data depedecy, which prevets the pipeliig withi the PEs, is the oe betwee the computatio of the rotatio agle ad the rotatio itself. I [], they propose to iterleave colums of differet iput matrices to overcome this depedecy, but this is upractical for may applicatios, especially for deep pipelies. O the other had, Square Root ad Divisio Free Gives Rotatios (SDFG) [3] are utilized i [8], where this depedecy is elimiated by meas of a pre- ad postprocessig which allows pipeliig of the PEs. Thus, this architecture achieves a high-throughput, but the complexity of the operatios ivolved also requires a high utilizatio of resource. III. PROPOSED ARCHITECTURE Similarly to the work i [8], we propose to use a Darray architecture where each PE works with all the elemets of the same row ad these PEs are pipeliig, to achieve high throughput. Yet, at the same time, we propose differet coectios for the PEs withi the D array to reduce latecy. Moreover, the PEs are desiged based o the algorithm to implemet this pipelie i a simpler way, which produces a system with lower area ad higher throughput. Next, we preset some details of this architecture. A. Gives Rotatios Schedule The classic schedule to implemet the Gives algorithm, as it is previously described i Fig., starts zeroig the bottommost elemet of the first colum, ad serially cotiues up i the same colum util this colum is fiished. The, the same procedure is performed over the ext colum ad so o, util the matrix is triagular. The D systolic array

3 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS,, VOL 6, NO., APRIL 05 3 Fig.. STAGE STAGE STAGE 3 G(,3) G(,) G(3,) STAGE G3(3,) G(,3) G(,) x *Gk(i,j)=Gives Rotatio of rows i ad j, k is the colum where a zero is iserted. Gives rotatios scheduled for icreasig parallelism. improves x x this x xschedule by performig several rotatios at the same time. Cocretely, as it is idicated i Fig 3, all PEs i thexsame x x diagoal x work i parallel, sigificatly reducig the umber x x x of x steps required for oe matrix computatio. However, this schedule ca be performed with a higher level of parallelism, if as may rows as possible were rotated simultaeously. Thereby the algorithm decreases latecy by reducig its amout of steps. We shou ote that the umber of Gives rotatios remais the same but the umber of Gives rotatios by step is icreased. To reduce the umber of steps as much as possible, o each step all suitable pairs of rows (i.e., two uselected rows that cotai the same umber of zeros to the left) are selected ad they are rotated i parallel. This is repeated util a upper triagular matrix is achieved. Fig. illustrates how a matrix is factorized by usig this schedule. I this figure, two types of lies are described, dotted ad cotiuous; each oe represets a Gives rotatio made simultaeously. I the first stage, two Gives rotatios are performed cocurretly, it takes the adjacet rows (, ) ad (3, ). This meas isertig two zeros i rows ad as show o the secod stage. Followig, rotatios G (, 3) ad G (, ) are calculated, fiishig the computatio o the first colum. The, the first row will ot be used agai, restrictig the algorithm to oly oe rotatio by stage for the two last stages. Therefore, oly four stages are required usig this schedule. B. -based processig elemets (COordiate Rotatio DIgital Computer) is a iterative algorithm based o shifts ad additios which allows calculatig may elemetary fuctios with a very simple hardware []. The same circuit may operate i two modes, vectorig or rotatio mode. The former rotates a iput vector (X, Y ) util its Y coordiate reaches a zero value, returig agle θ ad the X coordiate has rotated. The latter rotates a iput vector (X, Y ) with a determied agle θ. Therefore, a uit cou be used to compute the agle for a Gives Rotatio (vectorig mode) ad, the, performs the rotatio through the rest of the row usig said agle (rotatio mode). May differet circuits based o have bee proposed i the literature to perform Gives Rotatios. To achieve our goal, we have selected the oe used to implemet a liear systolic array i [5]. It is a pipelie architecture Fig. 5. [0] VEC-ROT PIPELINED phase cotrol [] [it] 0 σ[] 0 0 operatio sectio Y[0] X[0] Y[] shift() -based processig elemet. X[] shift() which performs both vectorig ad rotatio mode. Due to data depedecy of the liear array, the circuit preseted i [5] eeds matrix iterleavig to take advatage of the pipelie desig. However, withi our row-based D systolic array, the pipelie is used i a atural way. This approach replaces the computatio of the rotatio agle θ by the directio of each micro-rotatio. This directio is idicated by the sig of Y o each stage ad it is stored i a register (σ) to be used i subsequet rotatios. Thus, this circuit allows overlappig the computatio of the agle with the rotatio of the correspodig rows. Fig. 5 shows this circuit divided ito two sectios. The right sectio is the operatio sectio that cotais the typical x-y data-path. The left sectio represets the cotrol hardware which, i vectorig mode, selects the rotatio directio ad updates the σ registers. These registers cofigure the adds/subs i the rotatio mode. A active sigal idicates a ew agle computatio (vectorig mode) o this stage. The, sig(y ) is used to cotrol the ad it is stored i the σ register. While the active sigal goes through the pipelie, the rest of the elemets of the correspodig rows are itroduced ito the circuit to be rotated usig said stored directios (rotatio mode). Therefore, both computatios are overlapped ad, furthermore, it is clearly see that a ew Gives rotatio may be itroduced i the pipelie before the actual oe was completely fiished. The costat scale factor itroduced by the algorithm [] is compesated by multiplyig the output values by its iverse. As we will see i the ext sectio, costat multipliers or embedded multipliers cou be utilized for this operatio. C. Proposed circuit Usig the schedule described i sectio III-A, the proposed architecture is derived by assigig a PE to each Gives

4 0 σ[] σ[] σ[] σ[] σ[] σ[] x y x y STAGE STAGE STAGE 3 STAGE, phase cotrol operatio sectio [0] Y[0] X[0] phase cotrol operatio sectio [0] Y[0] X[0] X[] Y[] [] X[] Y[] [] shift() shift() shift() shift() [it], [it] x' y' x 3 y x' x y' y Row x' x3 y' 3 y3, phase cotrol operatio sectio [0] Y[0] X[0] phase cotrol operatio sectio [0] Y[0] X[0] X[] Y[] [] X[] Y[] [] shift() shift(), shift() shift() [it] [it] x' y',3 phase cotrol operatio sectio [0] Y[0] X[0] X[] Y[] [] shift() shift() Delay Register. [it] Row x'3 y'3 3 x y, phase cotrol operatio sectio [0] Y[0] X[0] X[] Y[] [] shift() shift() [it] Rows 3- Fig. 6. base architecture implemeted to factorize size matrices. rotatio. Fig. 6 illustrates the architecture for matrices. There are four stages, the two first oes with two PEs, sice two Gives rotatios are performed i parallel, ad oly oe i the two others. Iput ad output buses are coected directly from oe stage to the ext. Oly a FIFO register is required o stage 3, sice oe of the rows computed i stage is used i stage. Not much logic is required for sychroizatio of this architecture, due to its pipelied structure. The sigals of the PEs o the first stage may be set exterally, or usig a couter if the flow of iput matrices is regular. I the ext stages, the iputs are coected to the outputs of the previous stage. I some PEs, the sigal has to be delayed oe extra cycle to compesate for the zero elemets. I the first stage, all rows are itroduced simultaeously, elemet by elemet (iput matrix followed by idetity matrix). Furthermore, thaks to its fully pipelied architecture, a ew matrix computatio cou start right after the last elemet is itroduced. Therefore, a very high throughput is achieved (for this example, oe matrix computatio each 8 cycles). IV. PERFORMANCE ANALYSIS AND COMPARISON Usig the proposed architecture, a VHDL fixed-poit QR decompositio core for matrices has bee desiged. Said core allows us to cofigure both bit-width ad umber of iteratios. This core has bee simulated ad sythesized usig Xilix ISE.3 software, ad implemeted ad evaluated usig a hardware Virtex-6 XV6VLX0T speed - FPGA platform. To cofirm the correctess of the proposed core, first, it has bee tested with a wide rage of radom matrices ad the results have bee checked usig Matlab. Secodly, to improve the area ad the latecy of the proposed circuit, we have experimetally studied how the umber of iteratios iflueces the error of its results. To do this, differet circuits, usig three word-legths (6, ad 3 bits), have bee implemeted o our hardware platform for several umbers of iteratios. Usig each oe of these circuits, the QR decompositio has bee calculated for 50,000 radom matrices whose results have bee checked by computig Q t R ad comparig it with the origial matrix A. I Table I, the maximum error detected o these comparisos is preseted for each tested cofiguratio. It is clearly observed that, at first, the maximum error decreases whe the umber of iteratios icreases, due to the better approximatio achieved for the rotatio agle. However, at a certai poit, the error starts to slightly icrease due to the accumulated roudig error. Thus, to obtai miimum error while reducig the area x' y' TABLE I MAXIMUM ERROR OF QR FACTORIZATION FOR 6, AND 3 BIT WORD-LENGTHS DEPENDING ON THE NUMBER OF ITERATIONS. Word-Leght 6 bits -Iter Latecy (cycles) Max. Error.e-3 5.8e- 6.9e- 7.3e- 8.9e- Word-Leght bits -Iter Latecy (cycles) Max. Error 8.06e-5 6.0e-6 3.5e-6 3.7e-6.7e-6 Word-Leght 3 bits -Iter Latecy (cycles) Max. Error 3.3e-7.e-8 9.e-9.6e-8.3e-8 ad the latecy, the best cofiguratios for 6, ad 3 bitwidths are 0, 8, 6 iteratios, respectively. Table II shows the implemetatio results for the three aalyzed word-legths, each oe with three differet approaches for the scale factor compesatio required by the algorithm (see Sectio III-B). All desigs use the optimum umber of iteratios previously computed. Two approaches use the embedded multipliers (DSP8E) which typically exist i FPGAs, either o-pipelied or pipelied (Multiplier A ad Multiplier B, respectively). While, the third approach uses pipelied costat-coefficiet (Multiplier C) desiged with Xilix Core Geerator. There are ot great area differeces betwee the approaches usig DSP8, but the oe pipelied allows much higher clock frequecy ad, cosequetly, much better throughput at the cost of a moderate latecy icrease. O the other had, the umber of slices used to implemet the costat-coefficiet multipliers is relatively high compared to the rest of the circuit. The, although this approach achieves the same throughput as the pipelied-dsp8 oe, ad eve less latecy, this approach may be oly selected if it is required to save DSP8 for differet computatios. Fially, to study the effectiveess of our proposal, from the literature, we have selected some represetative works which provide eough data to perform a reasoable compariso. Tab. III shows the mea results of these works alog with the oes for our 6-bit circuit usig Multiplier A. To provide a fair compariso, we have sythesized our architecture o equivalet FPGAs as those works, cocretely Virtex (XCVFX60- ) ad Virtex5 (XC5VTX50T-). Regardig the performace, the oly desig with a throughput relatively close to ours is the oe i [8]. Similarly to our proposal, it uses pipelied PEs ad has practically the same latecy i clock cycles. However, its lower maximum frequecy provides that our proposal presets about 35% less latecy (secods) ad 50% more throughput tha the desig i [8]. The better critical path of our desig is explaied maily by the simplicity of the architecture. O the other had, the circuits preseted i [0] ad [7] have a throughput which is oe order of magitude lower tha ours, sice their PEs are iterative. Regardig the area, our desig clearly requires several

5 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS,, VOL. 6, NO., APRIL 05 5 TABLE II FPGA IMPLEMENTATION RESULTS FOR 6, AND 3 BITS WORD-LENGTHS. Device Xilix Virtex 6 XC6VLX0T - Word-Leght 6 bits bits 3 bits -Iteratios Cycles/Matrix Multiplier A Dedicated No-Pipelied DSP8E Latecy (cycles) DSP8E (%) (3%) 8(6%) Max. Freq. (Mhz) Slice Registers,0(%) 5,888(%) 0,8(3%) Slice LUTs,98(%) 6,09(%),337(7%) Multiplier B Dedicated Pipelied DSP8E Latecy (cycles) DSP8E (%) (3%) 8(6%) Stages Pipe. Mult. 6 Max. Freq. (Mhz) Slice Registers,57(%) 6,06(%),50(3%) Slice LUTs,587(%) 6,0(%),5(7%) Multiplier C Cost. Coef. Pipelied (without DSP8E) Latecy (cycles) Stages Pipe. Mult. 3 Max. Freq. (Mhz) Slice Registers 3,6(%) 8,00(%) 6,59(5%) Slice LUTs 3,6(%) 8,596(5%) 6,00(0%) TABLE III COMPARATIVE WITH OTHER FPGA IMPLEMENTATIONS Work [7] [8] [0] This Work (Multiplier A) DEVICE Virtex5 Virtex Virtex5 Virtex Virtex5 W-Legth 6 bits 6 bits 8 bits 6 bits Latecy(cl) Latecy(µs) Max. Freq. (Mhz.) 6 5 Throughput MMatrices/sec Slice Reg. 6,99 5,89 7,8,3,085 Slice LUTs 0,899 9,80,609,085,67 DSP8 8 Max. Error.e-3* 6.3e-3 5.8e- *Note. Maximum error for [7] is obtaied from a factored matrix sample, a complete error study has ot bee doe. times less resources tha the others, cosiderig all kids of resources. The closest oe is the circuit i [0], which also uses a -based architecture. Although it requires practically the same umber of LUTs ad o multipliers, the umber of registers is more tha three times greater. This is maily explaied by the much greater umber of PEs preseted i the architecture ad the use of carry-save arithmetic. Takig all these results ito accout, we cou coclude that the architecture proposed herei presets much better throughput, ad much lower resource utilizatio, tha previously proposed works. Moreover, this throughput cou be doubled by usig the pipelied versio of DSP8 at the cost of a moderate latecy icrease. V. CONCLUSION This brief presets a fixed-poit systolic architecture to achieve high-throughput QR Decompositio for small matrices. This is achieved by performig as may Gives rotatios as possible i parallel i a urolled architecture, ad usig a pipelied circuit which allows completely overlappig the agle computatio ad the rows rotatio. Thus, this highly pipelied circuit performs a matrix decompositio each clock cycles. The FPGA implemetatio of this architecture for matrices has bee optimized for differet word-legths by selectig the appropriate umber of iteratios. Comparig with previous FPGA approaches, our proposal highly improves both performace ad resources utilizatio. REFERENCES [] K. Sarrigeorgidis ad J. Rabaey, A scalable cofigurable architecture for advaced wireless commuicatio algorithms, Joural of VLSI sigal processig systems for sigal, image ad video techology, vol. 5, o. 3, pp. 7 5, 006. [] Z.-Y. Huag ad P.-Y. Tsai, Efficiet implemetatio of QR decompositio for gigabit MIMO-OFDM systems, Circuits ad Systems I: Regular Papers, IEEE Trasactios o, vol. 58, o. 0, pp. 53 5, Oct 0. [3] Y. Wu, J. McAllister, ad P. Wag, High performace real-time preprocessig for fixed-complexity sphere decoder, i Global Coferece o Sigal ad Iformatio Processig (GlobalSIP), 03 IEEE, Dec 03, pp [] S. Cha ad X. Yag, Improved approximate QR-LS algorithms for adaptive filterig, Circuits ad Systems II: Express Briefs, IEEE Trasactios o, vol. 5, o., pp. 9 39, Ja 00. [5] G. H. Golub ad C. F. Va Loa, Matrix Computatios (3rd Ed.). Baltimore, MD, USA: Johs Hopkis Uiversity Press, 996. [6] K. Booyi, J. Tagapaij, ad A. Boopooga, FPGA-based hardware/software implemetatio for MIMO wireless commuicatios, i Electrical Egieerig Cogress (ieecon), 0 Iteratioal, March 0, pp.. [7] S. Asla, S. Niu, ad J. Saiie, FPGA implemetatio of fast QR decompositio based o Gives rotatio, i Circuits ad Systems, 0 IEEE 55th Iteratioal Midwest Symposium o, 0, pp [8] M. Abels, T. Wiegad, ad S. Paul, Efficiet FPGA implemetatio of a high throughput systolic array QR-decompositio algorithm, i Sigals, Systems ad Computers (ASILOMAR), 0 Coferece Record of the Forty Fifth Asilomar Coferece o, Nov 0, pp [9] R.-H. Chag, C.-H. Li, K.-H. Li, C.-L. Huag, ad F.-C. Che, Iterative QR decompositio architecture usig the modified Gram-Schmidt algorithm for MIMO systems, Circuits ad Systems I: Regular Papers, IEEE Trasactios o, vol. 57, o. 5, pp , May 00. [0] D. Che ad M. Sima, Fixed-poit -based QR decompositio by Gives rotatios o FPGA, i Recofigurable Computig ad FPGAs (ReCoFig), 0 Iteratioal Coferece o, Nov 0, pp [] G. Prabhu, B. Johso, ad J. Rai, FPGA based scalable fixed poit QRD core usig dyamic partial recofiguratio, i VLSI Desig (VLSID), 05 8th Iteratioal Coferece o, Ja 05, pp [] A. El-Amawy ad K. Dharmaraja, Parallel VLSI algorithm for stable iversio of dese matrices, Computers ad Digital Techiques, IEE Proceedigs E, vol. 36, o. 6, pp , Nov 989. [3] J. Gotze ad U. Schwiegelshoh, A square root ad divisio free Gives rotatio for solvig least squares problems o systolic arrays, SIAM Joural o Scietific ad Statistical Computig, vol., o., pp , Jul 99. [] M. D. Ercegovac ad T. Lag, Digital arithmetic. Elsevier, 003. [5] J. Luo ad C. Jog, Scalable liear array architectures for matrix iversio usig Bi-z, Microelectroics Joural, vol. 3, o., pp. 53, 0.

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As