A New Memory Reduced Radix-4 CORDIC Processor For FFT Operation

IOSR Journal of VLSI and Sgnal Processng (IOSR-JVSP) Volume, Issue 5 (May. Jun. 013), PP 09-16 e-issn: 319 400, p-issn No. : 319 4197 www.osrjournals.org A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton Yasoda.A 1, Ramprasad.A.V 1 Department Of ECE, Vckram College Of Engneerng, Enath, Inda-630561 Professor, ECE, K.L.N College Of Engneerng, Pottapalayam, Inda -630611 Abstract: A complex number can be nterpreted as a vector n magnary plane. The vector rotaton n the x/y plane can be realzed by rotatng a vector through a seres of elementary angles. These elementary angles are chosen such that the vector rotaton through each of them may be approxmated easly wth a smple shft and add operaton, and ther algebrac sum approaches the requred rotaton angle. Ths can be exercsed by CORDIC ( CO-ordnate Rotaton Dgtal Computer) algorthm n rotaton mode. In ths paper, we have proposed a ppelned archtecture new memory less z-path elmnated CORDIC algorthm for FFT computaton. Ppelned archtecture by pre computaton of drecton of mcro rotaton, radx-4 number representaton, and the angle generator has been processed n terms of hardware complexty, teraton delay and memory reducton. Comparson of the proposed archtecture wth the conventonal radx-4 archtectures s elaborated. The proposed algorthm also exercses an addressng scheme and the assocated angle generator logc n order to elmnate the ROM usage for bottlng the twddle factors. It ncorporates parallelsm and ppe lne processng. The latency of the system s n/ clock cycles. The throughput rate s one vald result per eght clock cycles. Addtonally VLSI mplementaton on Vrtex -4 FPGA s done. The mplemented desgn operates at 450.654 MHZ of clock rate wth a power consumpton of 175.90mW. Keywords-CORDIC,, FPGA, latency, Radx-4, memory less systems, speed, throughput. I. Introducton FFT s the heart of sgnal & mage processng felds, also VLSI based sgnal processng fnd applcaton n Neural Networks[],bo-medcal sgnal processng[4],wreless communcatons[7],software Defned Rado[3], Radar mage processng, Synthetc Aperture Rado.[6].. Most of the applcatons adopt CORDIC(Co-ordnate Rotaton Dgtal Computer) based sgnal processng because of ts versatlty. Snce then CORDIC algorthm was developed [8 ], t extends ts applcaton to new areas day by day. For FFT processors, butterfly operaton s the most complcated stage. Generally butterfly unt conssts of complex adders & multplers usually multpler s the speed bottleneck n the ppelne FFT processor [11]. CORDIC [8] algorthm s an teratve algorthm to realze the butterfly operaton wthout usng any dedcated multpler hardware. CORDIC s an hardware effcent algorthm to realze transcendal functons, trgonometrc functons, multplcatons usng only smple add & shft operatons. In recent years several CORDIC based FFT was proposed for varous applcatons [1,,3,4,5]. The conventonal radx- CORDIC algorthm was mplemented usng word seral and dgt ppelned archtectures.[9] Later, ths algorthm was modfed as radx-4 to perform each mcro rotaton by two successve radx- mcro rotatons Lakshm et al. n [9], presented a ppelned archtecture usng sgned dgt arthmetc for the VLSI effcent mplementaton of rotatonal radx-4 CORDIC algorthm, elmnatng z-path completely. It acheves latency mprovement wth small hardware overhead. Garrdo [14] proposed a new memory less CORDIC algorthm to realze FFT operaton wth reduced memory requrements But mplementaton s lttle complex. Vachhan. L et al. n [19] presented two area-effcent algorthms and ther archtectures based on CORDIC. Ther frst algorthm elmnated ROM and requred only low-wdth barrel shfters and ther second algorthm elmnated barrel shfters completely. As a consequence, both the algorthms consume approxmately 50% area n comparson wth other CORDIC desgns but hardware complexty s large. In [16], a modfed CORDIC algorthm for FFT processors was ntended whch wpes out the need for bottlng the twddle factor angles. The algorthm generates the angles successvely by an accumulator. Ths approach reduces the memory requrements of an FFT processor by more than 0% [11 ]. Also, memory reducton mproves wth the ncreased radx sze. Moreover, the angle generaton crcut consumed less power consumpton than angle memory accesses. Instead of storng actual twddle factors n a ROM, CORDIC based FFT processor needs a dedcated memory bank to store actual twddle factor angles for butterfly operaton. In ths paper we have proposed a reduced memory z- path elmnated radx -4 CORDIC for FFT operaton It leads to reduced power consumpton, compared to conventonal CORDIC. www.osrjournals.org 9 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton The organzaton of the paper s as follows: In secton Fundamentals of CORDIC algorthm and ts operaton n FFT s dscussed. In secton 3 proposed methodology of z-path elmnated memory effcent CORDIC algorthm s dscussed. In secton 4 synthess results are dscussed. II. Radx-4-CORDIC-FFT algorthm N pont Dscrete Fourer Transform s defned as N 1 k X k = x W N =0 (1) where; k = 0,1,...,n 1 and W k N = e jπ N k s the twddle factor. There wll be log r N stages and each stage contans N/r butterfly operatons, n an N-pont FFT. The FFT-butterfly operaton can be realzed by radx-r cordc algorthm. CORDIC algorthm was ntally proposed by J.E.Volder 8[ ]. Because the CORDIC-based butterfly can be fast. The foundaton for CORDIC algorthm s a D vector rotaton n the xy-plane (see Fg.1 ). The vector rotaton can be realzed by rotatng a vector through a seres of elementary angles. These elementary angles are chosen such that the vector rotaton through each of them may be approxmated easly wth a smple shft and add operatons, and ther algebrac sum approaches the requred rotaton angle [8]. The CORDIC algorthm can be exercsed n two dfferent modes, vz. Rotaton mode and Vectorng mode. Y v +1 = (X +1, Y +1 ) Y +1 Y v = (X, Y ) Fg (1) A D vector rotaton usng CORDIC The rotaton mode s commtted to practce general rotaton for the gven angle and to compute elementary operatons such as multplcaton, exponental, trgonometrc functons and hyperbolc functons countng on the coordnate system. The vectorng mode s utlzed to defne the angular argument of the orgnal vector and estmate the logarthmc and dvson functons.the standard CORDIC system n crcular coordnate system practces n teratons for n-bt precson usng radx-r number representaton. For a gven n-bt precson, the ncrease of radx-rcuts down the number of mcro rotatons. For example, the radx-4 CORDIC system acheves latency reducton whch s an extenson of radx- CORDIC algorthm.[ 9 ] Usng radx-4 we acheve half the number of mcro rotatons than that requred n radx-. The teraton equatons of the radx-4 CORDIC algorthm at the(+1) th step [ ] under crcular coordnate system actng n rotaton mode are as follows: X 1 X * Y * 4 (a) Y 1 Y * X * 4 (b) tan 1 Z 1 Z ( *4 ) (c) Where, the drecton of rotaton n each teraton s σ ϵ, 1, 0, 1,, and an elementary angle tan 1 σ 4. The realzaton of Z+1 n Eq. (c) ncreases the length of the vector. To up hold the expected value of the vector, (+1) th coordnates must be multpled by the scale factor. The scale factor depends on the values ofσ and 1 1 must be estmated for each rotaton angle. K θ α X +1 K 1 * 4 (3) 0 0 www.osrjournals.org 1 / III. Proposed Methodology: Radx -4 CORDIC for ppelned FFT operaton, wth σ predcton block s proposed to reduce teraton delay, elmnate the z-path and also to cut down the memory requred to store the twddle factor angle. The number of teratons s reduced by usng radx-4 number system and teraton delay s reduced by radx-4 sgned dgt arthmetc and pre computaton of drecton of mcro rotaton. The radx-4 takes more tme to select from among the fve rotaton drecton values, and to select an approprate angle out of fve elementary angles, whch ncreases the teraton delay compared to the radx-. Nevertheless, the proposed CORDIC archtecture n rotaton mode s desgned to overcome ths ssue by pre complng all the drecton f rotatons for the generated X X 10 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton nput angle utlzng the lnear relaton between the rotaton angle and the framed bnary representaton of drectons of mcro rotatons [9]. The proposed archtecture comprses of four basc buldng blocks vz., the angle generator, the x/y path, the σ-predcton block, and the scale factor computaton blocks shown n Fg. The angle generator block produces the sequence of angles n regular and ncreasng order wth the help of an accumulator and control logc. Theσ-predcton block predcts all the σ values n radx-4 nterpretaton for the generated target angleθ, pror to the true CORDIC rotaton starts teratvely n the x/y path. Durng every clock cycle, the σ values are buffered and sent sequentally to the correspondng stages of x/y path. Angle generator Accumulator ɵ X n Y n σ- predcton block Scale factor X/Y path X +1 Y +1 Fg Block schematc of the proposed system 1) θ-(angle) generator block:- The key operaton of FFT s as shown n Eq. (1), X k = Rotate x() byan angle( j π N N 1 k =0 x W N.Ths s equvalent to k), whch can be realzed very well by the CORDIC algorthm wth smple shft and add operatons, CORDIC-based butterfly can be fast. As we know that, an FFT processor needs to store the twddle factors n memory. But, the CORDIC-based FFT doesn t have twddle factors but needs a memory bank to store the rotaton angles. CLK π N L A T C H Generated Angle (θ) Accumulator Control sgnal Fg 3 θ- generator of the proposed system. Usng a addressng scheme presented n [15], the twddle factor angles generated by a smple accumulator, follow a regular and ncreasng order. For example 64 -pont radx-4 FFT, t can be seen that twddle factor angles are sequentally ncreasng, and every angle s a multple of the basc angle( π N ), whch s( π 3 ). For log 4 64 = 3 dfferent FFT stages, the angles ncrease always one step per clock cycle. Ths functon can be realzed by the angle generator crcut whch s composed of an accumulator, and an output latch, as shown n Fgure 3. The accumulator output depends on the Control sgnals those enable or dsable the latch whch s smple to realze, as t s based on the present FFT butterfly stage and RAM address bts www.osrjournals.org 11 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton b n b b 1 b 0 (see Table 1). Usually, for an N = n pont FFT, an n 1 bt butterfly counter- B(b n b n 3 b 1 b 0 )wll satsfy the address sequences and the control logc of the angle generator. At each stage s, the RAM address bts are rotated rght by s -bts of butterfly counter B(b s 1 b s b 1 b 0 b n b n 3 b s )and the control logc of the latch s drven by the sequence of the pattern; b n b n 3... b s 0... 0 (s 0 s). )X/Y Path of the proposed Archtecture:- The basc structure of proposed desgn for radx-4 CORDIC processor s shown n Fg.4.The proposed archtecture requres n = 8 mcrorotaton stages to mplement x/y coordnates. The fnal x coordnates (see y Fg.4 ) are scaled usng the scale factor computed n parallel wth the mcro rotatons, whch requres Address Generaton table of the proposed desgn for radx 8 pont FFT Table-1 Butterfly Stage 0 Stage 1 Stage counter B b1b0 RAM address bts Twddle factor angles RAM address bts Twddle factor angles RAM address bts b1b0 b0b1 00 00 0 01 0 00 0 01 01 /4 10 0 01 0 10 10 /4 01 /4 10 0 11 11 3 /4 11 /4 11 0 b1b0 Twddle factor angles adder/subtractor, : 1-multplexer and an AND gate ( Fg.4 b). Then the RBC converts the sgned dgt representaton of the Radx-4 nto bnary form. Each mcro rotaton stage s desgned by realzng the teraton equatons, X X * Y * 4 Y 1 1 Y * X * 4 The Sgn Dgt adder/subtractor s controlled by the sgn of σ value (see Fg. 4b ). Ths σ values are realzed by the predcton block (see fg.3 ). X/Y Hardware realzaton: The x/y block of the proposed system for each stage s composed by combnaton of a par of sgned bnary adders, par of regsters at nput and output and a par of rght-shft regsters(see fg. ). For the generated angle θ, the nput vector v = (X, Y )s rotated to an output v +1 = (X +1, Y +1 )wth σ a predetermned drecton vector, whch assumes the drecton of rotaton. 3) σ Predcton block: The σ -predcton block requres ppelned mplementaton of to compute the drecton of mcrorotatons for the gven (generated by the Angle generator) target angle. Realzng, a lnear relaton between the rotaton angle and the bnary representaton of drectons of mcro rotatons [ ] gven by d = 0.5 θ + 0.5 c 1 + δ (4) where; δ = c 1 = n 3 =0 d ϵ =0 tan 1, and ϵ = tan 1 www.osrjournals.org 1 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton X n Y n σ 0 X/Y(0) σ, 1,0,1, σ 1 X/Y(1) Scale factor σ 7 X Y X/Y(7) σ 4 σ 4 K 1 +/- σ +/- (K 1 *X &K 1 *Y)->RBC X +1 Y +1 Fg 4. a) X/Y path of the system, b) realzaton of X/Y path s utlsed to perform the drecton of the vector. The drecton of mcro rotatons n bnary form s represented by the varable d (d θ) and θ has a lnear relatonshp wth negatve slope beng observed except for someθvalue, called breakponts [1, 7]. There by computng d from θ usng smple addton operaton as shown n the Fg.. The detals of the varous modules n the σ-precomputaton block are as follows. For 16-bt precson, a(3 * 48) ROM s selected to store these offset values [ ]. The offset value δ s pre-computed for any nput angle n the range( π/, π/) and the angle representng each of theseδ s referred to as θ ref, whch concdes to the summaton of the frst n/3 mcro rotatons. Therefore, the target angle has to be greater thanθ ref. Each locaton of ROM holds a reference angle and two offset values(δ k, δ k 1 ). δ k corresponds to the reference angle and δ k 1 correspondsto the prevous reference angle. For the gven nput angle,the offset value s obtaned from the ROM usng :1 multplexer. Output of ths multplexer s controlled by the comparator, whch generates sgn sgnal 0 f (θ > θ ref for selectng δ k )else 1 f(θ < θ ref for selectng δ k 1 ). The radx-4 arthmetc representaton of 0.5c 1 s added to s complement representaton of 0.5θ and δ rom usng two adders to obtanσ ϵ {, 1, 0, 1, }. The delay of each adder s tfa, where tfa s one full adder delay. These σ values are stored n a buffer to transfer them to the correspondng stages of x/y path. Thus, the drecton of rotatons s predcted before the x/y path ntates the computaton of x/y coordnates and followng nto elmnaton of the z path. 4 ) Scale factor The scale factor of the radx- CORDIC rotator s a constant but n case of radx-4 CORDIC rotator t sn t, ths s observable from the equaton below K 1 = n/4 =0 1 1 + σ 4 where, σ ϵ{, 1, 0, 1, }. Here, we compute K 1 only for frst + 1 = 5 teraton, as k 4 1 = becomes unty thereafter.for the present 16-bt mplementaton,t s adequate to approxmate scale 1+4 X out Y out factor for the frst 5- teratons only. Scale factor approxmaton s accomplshed n parallel wth the mcro n (5) 1 www.osrjournals.org 13 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton rotatons by bottlng all possble scale factors n memory [1, 4]. From the expresson above the σ wll have three values {0, 1, 4}.And so σ can choose any one of the three values {0, 1, 4} n each teraton, thereforefor 16-bt precson actually t s requred to store 35 possble scale factors n ROM. But, we can cut down the sze of the ROM to (7*16)usng Taylor seres expanson of the scale factor. For n + 1 wth 16-bt precson, 8 the scale factors correspondng to the frst 3-teratons are obtaned from(7*16) ROM. The 7 possble values for the scale factor K 1 correspondng to the frst three teratons are computed as k 1 1 = (6) 1+σ 4 The scale factor of the radx- CORDIC rotator s a constant but n case of radx-4 CORDIC rotator t sn t, ths s observable from the equaton below K 1 = n/4 1 =0 (7) 1+σ 4 where, σ ϵ{, 1, 0, 1, }. Here, we compute K 1 only for frst + 1 = 5 teraton, as k 4 1 = becomes unty thereafter.for the present 16-bt mplementaton,t s adequate to approxmate scale 1+4 factor for the frst 5- teratons only. Scale factor approxmaton s accomplshed n parallel wth the mcro rotatons by bottlng all possble scale factors n memory [1, 4]. From the expresson above the σ wll have three values {0, 1, 4}.And so σ can choose any one of the three values {0, 1, 4} n each teraton, thereforefor 16-bt precsonactually t s requred to store 35 possble scale factors n ROM. But, we can cut down the sze of the ROM to (7*16) usng Taylor seres expanson of the scale factor. For n + 1 wth 16-bt precson, the 8 scale factors correspondng to the frst 3-teratons are obtaned from (7*16) ROM. The 7 possble values for the scale factor K 1 correspondng to the frst three teratons are computed as k 1 1 = (8) 1+σ 4 n 1 K 1 = 1 =0 k (9) where, k 1 s the scale factor for the th teraton.usng these values, the scale factors for the remanng two teratons are evaluated usng a combnatonal logc as shown n Fg.7. The scale factors for thek 1 3 andk 1 4 (remanng two teratons) are derved by executng the shft and subtracton operatons over the scale factor retreved from ROM. Ths s carred out by realzng the frst two terms of ther Taylor seres expansons. And then, the scale factor s coded usng the canoncal sgned dgt representaton to carry out the fnal multplcaton over the x coordnates. y IV. Results & Dscusson: The algorthmc verfcaton of the CORDIC algorthm s done wth mat lab tool.(verson 7.1 R011a). The new z- path elmnated memory less CORDIC arc htecture has been mplemented on the avalable xc4vfx1-1 sf363 of the Xlnx FPGA (ISE13.4). The proposed archtecture s done n Verlog. The number clock cycles requred for computaton of ths new archtecture s n/ clock cycles wth accuracy of 3 output bt precson. The latency of the system s n/ clock cycles. The throughput rate s one vald result per eght clock cycles. The total power consumpton of the new proposed mplemented archtecture s measured wth X power software (ISE 13.4). The mplemented desgn operates at 450.654 MHZ of clock rate wth a power consumpton of 175.90mW. The resource utlzaton of FPGA mplementaton of new memory reduced Z- path elmnated Radx-4 CORDIC processor s shown below and t s performance comparson s shown n Table[ ] www.osrjournals.org 14 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton θ 0.5C1 1 0.5θ R O M θ ref θ Comparator δ k MUX δ k 1 sgn σ Buffer Fg.5 Realzaton of drecton of rotaton for an angle usng σ predcton Selected Devce : xc4vfx1 sf363-1 Number of Slces: 50 out of 547 9% Number of Slce Flp Flops: 936 out of 10944 8% Number of 4 nput LUTs: 899 out of 10944 8% Number used as logc: 494 Number used as Shft regsters: 405 Number of IOs: 65 Number of bonded IOBs: 63 out of 40 6% Number of GCLKs: 1 out of 3 3% Performance comparson wth [17 ] &[ 18] Parameter Ths work [17 ] [18] Target Xc4vfx1- XCV1000-6 xc3s1500 Devce sf363 BG560. 4fg676 Slces 50 797 984 LUT s 899 1554 1907 Power 175.90mW 380mW Not cted www.osrjournals.org 15 Page

A New Memory Reduced Radx-4 CORDIC Processor For FFT Operaton VI. Concluson: A new memory less Z path elmnated Radx-4 CORDIC archtecture s presented n ths paper. X and Y path ppelned and parallelsm computed work was done, also elmnates the need for storng angles to avod bottlng the twddle factor operaton n FFT.A 16 bt new radx-4 memory less Z-path elmnated CORDIC archtecture s done on FPGA platform(vrtex 4 ).The latency of the proposed archtecture s n/ clock cycles wth through put rate of one vald result per eght clock cycles. The entre new proposed archtecture operates at the clock rate of 450.654 MHZ wth power consumpton of 175.90mW.The desgned speed power optmzed 16 bt memory less Z-path elmnated Radx-4 can be applcable for FFT operaton. References: [1] Sathsh Sharma, Implementaton of ParaCORDIC Algorthm and t s Applcatons n Satellte Communcaton,009. Internatonal conference n Advances n Recent n Technologes n communcaton & computng [] Meng Qan, Applcatons of CORDIC Algorthm to Neural Networks n VLSI Desgn,IMACS 006,Bejng, Chna [3] Javer Valls, The use of CORDIC n Software Defned Rados: A Tutoral, IEEE Communcatons Magazne, Sep 006 [4] Ayan Benerjee, Swapna Banerjee, FPGA realzaton of a CORDIC based FFT processor for bomedcal sgnal processng, Mcroprocessors and Mcrosystems, Feb 011. [5].Bu-chn wang Dgtal sgnal processng Technques and Applcatons n Radar Image processng. [6] Bum Skkm, Low power ppelned FFT archtecture for Synthetc Aperture Radar, IEEE 39 th Mdwest Symposum, Crcuts & Systems 1996. [7] Sadat A, FFT for hgh speed OFDM wreless multmeda system, IEEE Crcuts & Systems 001, MWSAS 001 [8] Volder, J. (1959). The CORDIC trgonometrc computng technque., IEEE Transactons on Electronc Computers [9] B. Lakshm, A.S. Dhar VLSI archtecture for low latency radx-4 CORDIC, Elsever Computers and Electrcal Engneerng, July 011 [10] francsco j. jame,enhanced scalng-free cordc, eee transactons on crcuts and systems : regular papers, vol. 57, no. 7, july 010 [11] Erdal Oruklu & Xn Xao & Jafar Sane, Reduced memory low power archtecture for CORDIC based FFT processors. J Sgn Process Syst (01) [1] Ln, C., & Wu, A. (005). Mxed-scalng rotaton CORDIC (MSRCORDIC) algorthm and archtecture for hgh- performance vector rotatonal DSP applcatons. IEE Transactons on Crcuts and Systems I, 5(011). [13] Jang, R. M. (007). An area-effcent FFT archtecture for OFDM dgtal vdeo broadcastng. IEEE Transactons on ConsumerElectroncs, 53(4), 13 136. [14] Garrdo, M., & Grajal, J. (007). Effcent memory-less CORDIC for FFT Computaton. In IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng,, 113 116), Apr. [15] Xao, X., Oruklu, E., & Sane, J. (009). Fast memory addressng scheme for radx-4 FFT mplementaton. In IEEE InternatonalConference on Electro/Informaton Technology, EIT 009, [16] Xao, X., Oruklu, E., & Sane, J. (010) Reduced Memory Archtecture for CORDIC- based FFT. In IEEE Internatonal Symposum on Crcuts and Systems, [17] Anndya Sundar Dhar, Swapna Banerjee, Archtectural desgn and FPGA mplementaton of radx-4 CORDIC processor, Elsever, Mcroprocessors and Mcrosystems 34 (010) [18] M.G. Buddka Sumanasena, A scale factor correcton scheme for the CORDIC algorthm, IEEE Transactons on Computers 57 (8) (008) [19] Vachhan, L., Srdharan, K., Meher, P.K., "Effcent CORDIC Algorthms and Archtectures for Low Area and Hgh Throughput Implementaton," Crcuts and Systems II: Express Brefs, IEEE Transactons on Jan. 009,Volume: 56, Issue: 1 Page(s): 61-65 [0] Kuhlmann M, Parh KK. P-CORDIC: A precomputaton based rotaton CORDIC algorthm. EURASIP J Appl Sgnal Process 00;9:936 43. www.osrjournals.org 16 Page