Generation of Distributed Arithmetic Designs for Reconfigurable Applications

Size: px

Start display at page:

Download "Generation of Distributed Arithmetic Designs for Reconfigurable Applications"

Beverly Reynolds
5 years ago
Views:

1 Geeratio of Distributed Arithmetic Desigs for Recofigurable Applicatios Christophe Bobda, Ali Ahmadiia, Jürge Teich Uiversity of Erlage-Nuremberg Departmet of computer sciece Am Weichselgarte 3, Erlage, Germay Klaus Dae Uiversity of Paderbor Heiz Nixdorf Istitute Fürsteallee 11, Paderbor, Germay Abstract: We preset a tool for desig ad implemetatio of recofigurable computig applicatios based o the use of distributed arithmetic. Our tool provides the user the possibility to ivestigate differet tradeoffs like area vs speed for his desig. After simulatio of the desig, a sythesizable HDL code for a recofigurable platform ca be geerated. Beside the existig fixed-poit solutios for real umbers, we preset a ew approach to hadle real umbers i the IEEE 754 floatig-poit format. The tool is used i the implemetatio of two applicatios. The first oe is the implemetatio of a recursive covolutio algorithm for time domai simulatio of multimode itrasystem itercoects ad the secod oe is the implemetatio of adaptive mechatroical multi-cotroller systems. 1 Itroductio Distributed arithmetic (DA) [Wh89, Mi00, i95, i00a] is usually defied as computatio usig look-up table. The mai applicatio of DA is the dot-product computatio of two vectors, where oe of the two vectors is costat (i.e all the elemets are costat values). I this case, all the additios i which at least oe elemet of the costat vector is ivolved are precomputed ad stored i a look-up table. At ru-time, the elemets of the variable vector are used to address the look-up table ad retreive partial sums i a bit-serial (1-BAAT, 1 Bit At A Time) maer. Oe of the otable cotributio i DA has bee doe by White [Wh89]. He proposed the use of ROMs to store the pre-computed values. The surroudig logic to access the ROM ad retrieve the partial sums has to be implemeted o a separate chip. Because of this moribud architecture, the DA method could ot be successfully used. With the appearace of SRAM based FPGAs, the DA became a iterestig alterative to implemet sigal processig applicatio i FPGA [Mi00, i95, i00a]. Because of the availbility of SRAMs i those FPGAs, the precomputed values could ow be stored i the same chip as the surroudig logic. 205

2 Altough DA is used for may applicatios, the users still have to desig their systems ad ivestigate the differet tradeoffs by had. This process is ot always easy ad ca be time cosumig. O the other had, fixed-poit format is used to represet real umbers. This results i the lost of accuracy as well as the limitatio of the umbers rage. We have developed a framework to help desigers i the developmet of sigal processig applicatios usig the DA. Moreover we are able to hadle real umber i the IEEE 754 floatig-poit format. The rest of the paper is orgaized as follows: Sectio 2 provides the basics of distributed arithmetics as well as our method for hadlig real umbers. I sectio 3 our tool, the DA geerator is preseted while sectio 4 shows how DA ca efficietly beefit from partial recofiguratio. I sectio 5 we preset the applicatio of our framework for the geeratio ad evaluatio of the code for two sigal processig applicatios. Sectio 6 cocludes the paper ad gives some idicatios for future work. 2 Distributed Arithmetic The idea behid the distributed arithmetic is to distribute the bits of oe operad across the equatio to be computed, such as to obtai a ew equatio which ca the be computed i a more efficiet way. For the dot-product, we are give the followig equatio: Z A i A i (1) The vector A is a costat vector of dimesio while is a variable vector of the same dimesio. With the biary represetatio w j 0 1 i j2 j (where w ist the width of the variables ad i j 0 1 is the j-th bit of i ) of the i s, equatio (1) ca be writte as: Z A i w 1 j 0 w i j 2 j 1 2 j i j A i (2) j 0 Equatio (2) represets the geeral form of a distributed arithmetic sice each bit of each variable operad cotributes oly oce to the sum i ja i. Because i j 0 1, the umber of possible values i ja i ca take is limited to 2. Therefore, they ca be precomputed ad stored i a look-up table. We call such a look-up table a distributed arithmetic look-up table (DALUT). The DALUT has the size w 2 bits. The -tupel ( 1 j 2 j j ) is the used to address the DALUT ad retrieve the value i ja i. The complete dot-product Z requires w steps to be computed i a 1-BAAT maer. At step j, the j-th bit of each variable are used to address the DALUT ad retrieved the value i ja i. This value is the left shifted by a factor j (which correspods to a multiplicatio by 2 j ) ad accumulated. After w accumulatios, the values of the dotproduct ca be collected. The DA datapath is the obvious as illustrated i Figure 1. May ehacemets ca be doe o a DA implemetatio. The size of the DALUT for example could be halved if oly positive values are stored. I this case, the first bits of a umber represetig the sig will be used to decide if the retrieved value should be added 206

3 DALUT- Address 0 0(w-1) 1(w-1) A 0 A 1 A 0 + A 2 A 1 A 2 + A 0 A 2 + A 1 (w-1) 1 0 A 2 + A 1 + A 0 j Shift + Z Abbildug 1: The DALUT dot-product computatio or substracted from the accumulated sum. O the other had it is obvious that all the bit operatios are idepedet from each other ad therefore could be doe i parallel. The degree of parallelism depeds o the available memory to implemet the DALUTs. I the case where w DALUTs ca be istatiated i parallel, computatio of the complete dotproduct ca be doe i oly oe step. I geeral, if k DALUTs are istatiated i parallel, the w k steps are required for the complete computatio. The computatio is doe i this case o a k-baat basis. 2.1 High Dimesioal Distributed Arithmetic I may areas, for example i mechaical cotrol, computatios are ot oly limited to dot-product. Matrix operatios are used as show i equatio (3). z 1 z 2... z s a a 1r... a 1s... a sr x 1 x 2... x r (3) Equatio (3) ca be implemeted usig s DALUTs. The i-th (i 1 s ) DALUT is used for the dot product z i r j 0 x j a i j usig the costats a i1 to a ir. If there is eough space o the chip to hold all the DALUTs ad the resultig adder tree, the equatio (3) ca be computed i just oe clock. If there is ot eough space to hold the adder tree but eough memory to hold the DALUTs the the computatio have to be doe sequetially. As we will see later partial recofiguratio ca also be used to implemet may dot-products if there is ot eough memory o the chip to hold all the DALUTs. 207

4 2.2 Floatig-poit Distributed Arithmetic The straightforward approach to hadle real umbers i distributed arithmetic is the use of fixed-poit which does ot differ from the iteger implemetatio. However, the rage of a fixed-poit represetatio as well as their precisio is small compared to those of a floatig-poit represetatio. Therefore we will like to hadle real umbers as floatigpoit. I this sectio we preset our cocept for hadlig real umbers as floatig-poit i the IEEE 754 format i distributed arithmetic. I IEEE 754 format, a umber is represeted as follows: 1 S 2 e 1 m (4) Where e is the expoet (we cosider that the substractio with the bias is doe ad the result is e ), m is the matissa ad S is the sig of. Without loss of geerality, we will cosider the sig to be part of the matissa, thus the ew represetatio is 2 e m. With A, ad Z beig all floatig-poit umbers, the floatig-poit couterpart of equatio (1) is give by: Z A 2 e A i 2 e i i m Ai A i 2 e A i m Ai m i 2 e A i e i 2 e i m Ai Our goal is to compute ad provide Z as floatig-poit umber. Therefore we would like to read (at each step of the computatio) a floatig-poit value F i from the floatig-poit DALUT ad add it to a accumulated sum which is also a floatig-poit umber. Sice the adder used at this stage is a floatig poit adder, issues like roudig ad ormilizatio will be cosidered i its implemetatio. From the last part of equatio (5), it is obvious that the value 2 e A i e i m Ai m i represets a floatig-poit umber with expoet e Ai e i ad matissa m Ai m i. By settig e Fi e Ai e i ad m Fi m Ai m i, we have the requested values at each computatio step. Istead of computig the expoetial part 2 e A i e i of F i as well as its matissa m Ai m i j olie, our approach cosist of usig two floatig-poit DALUTS for each costat A i. We precompute ad save the values e Ai e xi i the first DALUT ad m Ai m i i the secod oe. We call the first DALUT which stores the expoets the EDALUT ad the secod DALUT which stores the matissas the MDALUT. The size of EDALUT: size EDALUT as well as that of MDALUT: size MDALUT are defied i equatios (6). m i m i size EDALUT E 2 E bits (6) (5) size MDALUT 2 M M bits (7) E is the expoet width ad M is the matissa width of the floatig-poit represetatio. The mai argumet agaist our approach could be that the size of the DALUTs used for this floatig-poit DA implemetatio will be too big ad therefore, the method could ot 208

5 be implemeted. But if we cosider a DA implemetatio ivolvig five variables ad five coefficiets represeted i the IEEE 754 floatig-poit format with 8 bits expoet ad 10 bits matissa, the total memory requiremet for the EDALUTs ad the MDALUTs is: Kbits. Therefore the EDALUTS ad the MDALUTs will easily fit ito the smallest low cost FPGA (the Sparta III, 50) from ilix which has 72 Kbits Block RAM [i00b]. Our approach is therefore suitable for FPGA implemetatio. Havig the EDALUTs ad the MDALUTs for each coefficiet, the datapath will ot e x m x e x1 m x1 e x0 m x0 LUTs 1 e A1 e A1 + 1 ma1 ma1 * 2 Cotrol e A1 + 2 E ma1 * 2 M e A2 ma2 LUTs 2 + e A2 2 E m A2 * 2 M e Fi + m Fi e A ma e Zi m Zi LUTs ea + 2 E ma * 2 M Abbildug 2: The DALUT dot-product computatio be implemeted as i the fixed-poit or iteger case. The variables are o more iput i a bit-serial way. At step i the variable i is used to address the two i-th EDALUT ad MDALUT. The bits of e i are used to access the EDALUT while the bits of m i are used to access the MDALUT i parallel. The values collected from the EDALUT ad the MDALUT are used to build the floatig poit umber F i. After steps the floatig-poit dot-product is computed. Figure 2 shows the datapath for the floatig-poit DA. Sice all the DALUTs for the coefficiets are available o the device, they ca be accessed i parallel to compute the dot-product i oly oe step. 3 The DA Geerator We have developed a tool to help the user i the developmet of DA-based sigal processig applicatios. Because those applicatios are usually based o the computatio of the dot-product of oe vector of variables with a vector of costats, they are ideal cadidates for a DA implemetatio i recofigurable hardware. I our tool, the user ca ivestigate the differet tradeoffs for his applicatio. For a give device ad a give set of costats, we geerate differet DA implemetatios from the full adder tree (w-baat) which ca 209

6 be computed i oly oe step to the full sequetial DA (1-BAAT). For each possibility, the user is provided the area ad speed of the desig. Real umbers ca be hadled either as fixed-poit or as floatig-poit i the IEEE 754 format with the techique previously defied. The width of the matissa as well as that of the expoet has to be provided. As Hardware Descriptio Laguage we use the Hadel-C laguage. 4 Use of Recofiguratio The possibility of exchagig mechatroical cotrollers i a ruig system usig FPGA has bee preseted i [DBK03]. A adaptive mechatroical system made upo a moitorig part remaiig fix (FM) all the time ad two recofigurable cotrollers (CM1, CM2) are preseted. At each poit of time, oe of the two cotrollers must be active to cotrol the plat. Depedig o the eviromet factors, the moitorig part ca place a ew cotroller o a predefied slot occupied by the iactive module usig partial recofiguratio. The cotroller of the plat is the chaged by switchig from the active cotroller to the ew loaded oe. With this, the system remais active while adaptig itself to its eviroemet. The cotroller codes are implemeted ad stored as partial bitstreams i a ROM ad ca be dowloaded i the correspodig slot by the moitorig module. A substatial effort have bee doe to implemet the system i FPGA. Because the routig tools ca route the coectio betwee the fixed module ad the recofigurable cotroller i differet ways across differet cofiguratios, the system may ot work properly if partial recofiguratioi is doe to replace a old cotroller. To avoid this, fixed commuicatio chaels had to be defied to allow the fix moitorig module to commuicate with the recofigurable parts of the desigs. This process is difficult to be automatized with the curret tools. Therefore most of the work has to be doe by had. I geeral, recofiguratio ca be doe either by chagig the DALUTs cotet or by chagig the datapath from a x-baat to a y-baat implemetatio. The previous described adaptive mechatroical system ca be easily implemeted usig distributed arithmetic. Moreover, for a x-baat datapath, the cotrol part is the same for all DA desigs with a give umber of variables, it does ot eed to be chaged o recofiguratio. The oly thig we eed to do is to fill the memory with the correct combiatios of the costat values ad restart the computatio with the fixed cotrol part. This process is easy to be automatized ad ca be itegrated i CAD tools. 5 Applicatios Our tool was used i the desig of two sigal processig applicatios with differet tradeoffs. The first applicatio is the recursive covolutio algorithm of time domai simulatio of optical multimode itrasystem itercoects ad the secod is the implemetatio of the adaptive mechatroic cotroller previously described. The two applicatios are described below. 210

7 5.1 Recursive covolutio algorithm of time domai simulatio of optical multimode itrasystem itercoects I geeral, a optical itrasystem itercoect cotais several receivers with optical iputs drive by trasmitters with optical outputs. The itercoectios of trasmitter ad receivers are made by a passive optical waveguide which are represeted as a multiport (figure 3) usig ray tracig approach. The trasfer of a optical sigal alog the wave- )+*,* #%$ "! &(' -+.0/. 7(89 :(;< ; 1%23 4%5,6 5 Abbildug 3: A optical multimode waveguide is represeted by a multiport with several trasfer paths. guide ca be computed by a multiple covolutio process. Frequecy domai simulatio methods are ot applicable regardig to the high umber of frequecy. Pure time domai simulatio methods are more efficiet if the pulse resposes ca be represeted by expoetial fuctios.the applicatio of a recursive method for the covolutio of the optical stimulus sigals at the iput ports with the correspodig pulse resposes eables a time efficiet computatio of the optical respose sigals at the belogig output ports. The recursive formula to be implemeted i three differet itervals is give by equatio (8). y t f 0 = y t 1 (8) f 4 = x 0 f 5 = x 1 f 24 = x 2 f 53 = x 3 f 0, f 4, f 4, f 24 ad f 53 are costats while t 1, x 0, x 1, x 2, x 3 are variables. Therefore DA ca be applied for this computatio. For this equatio, differet tradeoffs were ivestigated i our framework. A hadel-c code were geerated ad the complete desig was implemeted o a system made upo the Celoxica RC100-PP board equipped with a ilix Virtex 2000E FPGA ad pluged ito a workstatio. The workstatio is used for sedig the variable to the FPGA ad collectig the result of the computatio. The implemetatio of equatio (8) i three itervals occupies about 14 % of the FPGA area while ruig at 65 MHZ. Because we had o restrictio o the space, our goal was to implemet the maximum k-parallel DALUT ad therefore, icrease the computatio speed. We could therefore implemet a 6-parallel level DALUT, thus icreasig the performace of factor 6. The 6-parallel DALUT ad the correspodig adder three occupy about 76 % of the FPGA area. Eough space is left for the routig. 211

8 Workstatio 1 iterval 3 itervals Su Ultra ms ms Athlo (1.53 GHZ) 558 ms ms FPGA (time) 1 iterval 3 itervals Pure dot-product 25.6 ms 76.8 ms Sequetial DA 19.4 ms 19.4 ms 3-parallel DA 6.4 ms 6.4 ms FPGA (area) 1 iterval 3 itervals fit fit Pure dot-product o o Sequetial DA yes (7 % ) yes (14 % ) 3-parallel DA yes (14 % ) yes (42 % ) Tabelle 1: Results of the recursive covolutio equatio o differet plattforms The same desig was implemeted without use of DA. It could ot fit ito the same FPGA ad the ru-time was much loger. The compariso of our DA implemetatio with other implemetatio is give i table1. The performace as well as the area cosumptio of our implemetatio is more efficiet tha that of all the other architectures. 5.2 Digital Liear Cotroller The task of a cotroller is to ifluece the dyamic behavior of a system referred as plat. If the iput values for the plat are calculated o basis of the plat s outputs, we refer to a cotrol feedback. A commo basic approach is to model the plat as a liear time-ivariat system. Based o this model ad the requiremets of the desired system behavior, a liear cotroller is systematically derived usig formal desig methods. The cotroller as a result of the sythesis cosidered above, is described as a liear time-ivariat system ad a time discretizatio is performed which results i equatio (9). The iput vector of the cotroller is represeted by u (measuremets from sesors of the plat),y is the output vector of the cotroller (regulatig variable to actuators of the plat) ad x is the ier state vector of the cotroller. The matrices A B C ad D are used for the calculatio of the outputs based o the iputs. where p dim u, dim x ad q dim y x k 1 Ax k Bu k y k Cx k Du k (9) The task of the digital system is to calculate equatio (9) durig oe samplig iterval. That icludes determiig the ew state x k 1 ad the output y k before the ext samplig poit k 1. The state space equatios of a digital liear time ivariat cotroller (equatio (9)) ca be 212

9 writte as a product of a matrix of fix coefficiets ad a vector of variables. z> q? 1@ M> q? p@ v> p? 1@ (10) I geeral, the calculatio of z evolves q times the computatio of a p dimesio scalar product (each row of M has to be multiplied with v). The matrix M is ofte sparse, thus the dimesio of the scalar products ca be reduced from p to e i, where e i is the umber of o zero places i row i of M This ca be used to reduce the size of the DALUT i a DA implemetatio of eq Tradeoffs of Distributed Arithmetic Implemetatio Equatio (10) was implemeted i our framework ad the area/time tradeoff was explored o differet levels from the 1-BAAT to the w-baat computatio. Further o, we have ivestigated the differet levels of parallelism for the q dimesioality. This rages from the full sequetial implemetatio of the ivolved dot-products usig oe data-path to the full parallel implemetatio usig q data-paths. Moder FPGA architetures offer several techiques to implemet the DALUTs. For each dot-product block-ram ca be used. If we cosider usig dual-port RAM, the the scalar-products ca be computed i a 2-BAAT fashio. Because the DALUT etries do ot chage durig the computatio, higher order multi-port RAMs ca be used. Those RAMs easily ca be builded by multiple istatiatio of oe RAM. Fialy, the DALUT ca be implemeted usig the FPGA lookup tables. The cotroller which was used i this experimet is that of the iverse pedulum. It has three iputs (p 3), two ier states ( 2) ad oe output (q 1) ad operates with a wordwidth of 20 bits. Therefore the matrix M has a dimesio of 3 5. Sice some of the matrix etries are zero, three scalar-products have to be computed with dimesio 2, 2 ad 5 respectively. Table 2 shows the sythesis results for trade-offs usig 1-BAAT (full sequetial), 5-BAAT, 10-BAAT ad 20-BAAT (full parallel). It shows the latecy as well as the area occupatio (as umber of slices) of the differet implemetatios o a FPGA VirtexE. Archit. Cycles Latecy Datapath Area AT-Product (s) (op.iputs) (slices) (slices*s) 1-BAAT BAAT BAAT BAAT BAAT Tabelle 2: Ivers pedulum cotroller: Sythesisresults 213

10 6 Coclusio We have preseted a tool for desig ad implemetatio of sigal processig applicatios based o the use of distributed arithmetic. After a short itroductio of the distributed arithmetic, we have show how to hadle real umbers i the IEEE 754 floatig-poit format. We have used the the framework for the implemetatio of two sigal processig applicatios as well as the ivestigatio of differet trade-offs. The results were preseted ad compariso to other implemetatio was also made. Future work icludes the itroductio of partial recofiguratio routie i the framework. This will help the user to easily mark the part of his applicatio which will be replaced at ru-time. The framework will the be used to geerate the partial cofiguratio which will be used to move from oe state of the device to the ext oe. Literatur [DBK03] Dae, K., Bobda, C., ud Kalte, H.: Ru-time exchage of mechatroic cotrollers usig partial hardware recofiguratio. I: Proc. of the Iteratioal Coferece o Field Programmable Logic ad Applicatios (FPL2003), Lisbo, Portugal. September [Mi00] [Wh89] [i95] Mizer, L.: Programmable silico for embedded sigal processig. Embedded Systems Programmig. S March White, S. A.: Applicatio of distributed arithmetic to digital sigal processig: A tutorial review. IEEE ASSP Magazie. S July ilix: A guide to usig field programmable gate arrays (fpgas) for applicatio-specific digital sigal processig performace [i00a] ilix: The role of distributed arithmetic desig i fpga-based sigal processig [i00b] ilix: Sparta-3 fpgas

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As