FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

Size: px

Start display at page:

Download "FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea"

Jessie Stephens
5 years ago
Views:

1 FPGA IMPLEMENTATION OF BASE-N LOGARITHM Salvador E. Tropea Electróica e Iformática Istituto Nacioal de Tecología Idustrial Bueos Aires, Argetia salvador@iti.gov.ar ABSTRACT I this work, we preset a area optimized FPGA implemetatio of a IP core to compute the base-n logarithm. Nevertheless, we also discuss the area, speed ad precisio trade-offs. We selected a algorithm that could be implemeted o ay FPGA avoidig vedor specific features like block RAMs, embedded multipliers, etc. We report the implemetatio results of a fixed poit versio of the algorithm usig various commo cofiguratios o Xilix ad Actel devices. This implemetatio achieved the required area goals providig a very good speed-area ratio. Keywords: logarithm, FPGA, area optimized, multiplicative ormalizatio 1. INTRODUCTION The computatio of the logarithm is widely used for may applicatios like digital filters, power computatios (db) ad logarithmic umber systems (LNS), to ame a few. This is also a key compoet of LNS, which offer good advatages over traditioal floatig poit systems [1]. This work is orgaized as follows: sectio two deals with about the algorithm selectio ad how it works, sectio three describes its implemetatio, sectio four briefly discusses the error itroduced by the approximatio, sectio five shows the results of the implemetatio ad sectio six provides the coclusios. 2. ALGORITHM SELECTION AND DESCRIPTION The available methods to compute the logarithm of a umber usig digital circuits ca be divided i two mai groups. O the oe had, we have the look-up table based algorithms ad, o the other, iterative methods. The first approach is faster ad straightforward, but oly useful for low precisio. This is due to the size of the look-up table. The secod group is slower, but suitable for high precisio. May studies explore hybrid implemetatios that take advatages from both groups [2] ad [3], [4], [5] cited by Kostopoulos. Our project required a algorithm that could be implemeted o FPGAs from ay vedor, icludig the Actel's RTSX72SU FPGA. This is a radiatio tolerat device ad lacks the BRAMs eeded to implemet big lookup tables. Therefore, we oly evaluated iterative algorithms that eed small look-up tables. Moreover, we coded the algorithm usig stadard VHDL to esure its portability. Taylor's series expasio is amog the most popular methods to maually compute logarithms, but it has a slow covergece ad requires slow operatios like the divisio [6]. Other algorithms like the oe proposed by Kostopoulos [7] use a multiplier iside the iteratio; therefore, they are slow whe o embedded multipliers are available. A very well studied alterative is the multiplicative ormalizatio. It just eeds adders ad shifters ad provides a fast covergece [6], [8], [9], cosequetly it was suitable for our project Multiplicative Normalizatio Based o the followig logarithms property log a b =log a log b (1) Assumig that x is our variable, the multiplicative ormalizatio uses the followig values for a ad b log x Selectig the b i values to get we obtai b i =log x log b i (2) x b i =1 (3) log x = log b i (4) Storig the log(b i) values i a small look-up table the logarithm computatio is reduced to sums. y i 1 = y i log b i (5)

2 The oly requiremet is (3). To reduce the multiplicatios used i (3) to simple shifts we could use b i =1 d i 2 i (6) I this way the product computed i each iteratio of (3) is ad the logarithm iteratio x i 1 =x i x i d i 2 i (7) y i 1 = y i log 1 d i 2 i (8) Now we must select the d i values to achieve (3). The fuctio to determie d i is called decisio fuctio A Simple Decisio Fuctio The selectio of the decisio fuctio affects the covergece speed ad the algorithm precisio. Although high-radix systems cosume big amouts of area, they provide high speed. However, selectig a simple decisio fuctio we ca achieve a small area footprit. We selected 1 ad 0 as the oly possible values for d i ad applied the followig criterio: if usig 1 makes x i+1 bigger tha or equal to 1, the, we simply use 0. As the b i values are smaller ad smaller this method asymptotically approaches to 1. The hardware implemetatio was very simple because to determie whether the resultig value is bigger tha or equal to 1 we just eed to check oe bit i the result. Whe this bit becomes 1 the iteratio is simply skipped because (7) ad (8) are: x i 1 =x i 0 y i 1 = y i log 1 (9) This fuctio uses oly two values ad is simpler tha the oe described by Kore [6]. Ulike the fuctios explaied i [6] ad [9], this fuctio provides a good covergece of the approximatio for the whole rage of values Covergece Radius Replacig (6) i (3) we obtai x= 1 1 d i 2 i (10) The smaller value we ca obtai is whe the decisio fuctio is always 1 ad the bigger whe the fuctio always selects 0. 1 x i (11) 2.4. Rage Extesio As (11) shows the valid rage for the variable is small. This is ot a problem for floatig poit arithmetic where the variable is ormalized. Whe the values are ot ormalized, as i the fixed poit case, we must reduce them ad, the, compesate the reductio at the output of the approximatio. A simple method to achieve this is to fid a umber that multiplied by the variable results i a umber that satisfy (11). If we call z to this value ad use (1) we ca write log x =log z x log z (12) The product computatio ca be simplified by choosig z=2 j (13) where j is selected, as metioed above, ad the oly thig that we must solve is log z =log 2 j (14) These values ca be stored i a look-up table. Whe the size of the look-up table becomes a issue, we ca solve this as follows: (14) is very simple to compute whe the base of the logarithm is 2 replacig (15) i (12) z = 2 j = j (15) x = 2 j x j (16) Whe we eed to compute aother base, we could use the followig property applied to (16) log a x = log b x log b a (17) log a x = 1 a 2 j x j (18) This method replaces the above metioed look-up table by a multiplicatio usig a costat. 3. ALGORITHM IMPLEMENTATION Our project used usiged fixed poit iput values, whereby, we split (18) i N =2 j x (19)

3 A= N j (20) M = 1 a A (21) where we call (19) ormalizatio, (20) approximatio ad (21) base selectio. Note that whe usig floatig poit values the matissa is already ormalized ad we ca start computig (20) usig the expoet as j ad the ext step (21) is the same. The mai differece is that the result of (21) must be ormalized like i (19) achievig a comparable complexity. Moreover, additioal circuitry could be eeded to detect values out of the logarithm domai Normalizatio This step must fid the value of j that satisfies (11). Whe m=1 ad 3 we ca satisfy (11) usig or i biary otatio 0.5 x 1 (22) 0.1 x 1.0 (23) For a usiged fixed poit value with E iteger bits ad D decimal bits we could fid the value of j as follows: we start with j=e, the we shift the value to the left util the most sigificat bit is 1, decreasig j i each iteratio. The result is a ormalized fixed poit value with zero iteger bits ad E+D decimal bits. Example: E=5, D=3, x= x= j=5 x= j=4 x= j=3 The result is j=3 ad the ormalized value is This algorithm cosumes up to E D 1 steps to achieve the ormalizatio. Noetheless, whe the iput values are chose at radom, half of the values are completed i zero steps ad oly oe value i 2 E D 1 eeds E D 1 steps. This process could be implemeted usig a combiatioal circuit, but it could cosume a much bigger area Approximatio High-radix versios of the multiplicative ormalizatio could be used for speed optimizatio, but for area optimizatio the proposed decisio fuctio is more suitable. The m value i (4) does ot eed to be 0, thaks to the ormalizatio step, ad we ca start from 1. The value is determied by the umber of bits used i the approximatio. As the log(b i) values quickly decrease, icreasig their trucatio error, we observed that for N bits the value of should be N 2, where N is selected to achieve the desired precisio. We implemeted the decisio fuctio (7) usig N+1 bits, thus the MSB of x i+1 is used to determie whether the step is skipped. A barrel shifter was used to implemet the product, while a serial shifter did ot show a importat area savig ad slowed dow the computatio. Fig. 1 (a) shows the flow chart, where Table cotais the log(b i) values ad the y variable is the approximatio result. Fig. 2 shows a simplified block diagram. We also icorporated the j additio to this block ad a error compesatio coefficiet explaied i sectio four (25) Base Selectio This step is a multiplicatio by a costat ad provides a base chage. The more commo bases are 2, e (atural logarithms) ad 10. I the first case the costat is 1 ad this step is skipped, the approximated costats for the other two cases are ad This block ca take advatage of embedded multipliers whe available. Whe optimizig for speed, high speed multipliers (i.e. adder tree) ca be used. I our case we used a traditioal shift-ad-add algorithm. We optimized the algorithm loadig the multiplier with the result obtaied after the first step ad all the subsequet 0s. This ca be illustrated with the followig example: let us suppose that we are computig the base 10 logarithm ad that we determied that the costat must be trucated at the 14 th bit to achieve a desired precisio 1/ The 0s at the right ad at the left ca be skipped ad we get We modified the load of the multiplier to load the result obtaied after the use of the four LSBs. As a result, the multiplier oly eeds to compute the remaiig I this example, a 14-bit multiplicatio is reduced to a 7-bit multiplicatio. Fig. 1 (b) shows the flow chart, where OPT is the umber of bits optimized durig the load ad Kap() represets the th bit of the costat Iterblock Commuicatio The above metioed blocks eeds differet amouts of clock cycles to fiish their tasks ad, which is eve worst, the first step eeds a variable umber of clock pulses. For this reaso, we implemeted a simple hadshake mechaism to commuicate the blocks. The protocol uses a request lie to idicate that a ew value is available for the

4 Fig. 2. Simplified block diagram for the approximatio 4.2. Approximatio Error Fig. 1. Flow charts: approximatio (a) ad multiplier (b) cosumer ad a ackowledge lie to idicate the value was cosumed. The reported results iclude the area ad clock pulses eeded for the above metioed hadshake. The resultig core is composed by three idepedet blocks allowig the computatio of up to three values o the fly. 4. ERROR ANALYSIS I this sectio we briefly discuss the error itroduced by this core. Uless otherwise specified all the errors are expressed i couts of the LSB, kow as ulp (uits i the last place). We oly provide a simplified aalysis to estimate the maximum error, the actual error should be less tha or equal to the estimated oe Normalizatio Error Whe the total umber of iput bits is bigger tha the umber of bits used i the approximatio, this block must trucate the value. I this case, the block itroduces a error i the (-1;0] rage. Addig a rouder, the itroduced error is i the (-0.5;0.5] rage. However, it does ot reduce the total error, as we will see later. This error ca be measured by simulatio; we used a C implemetatio for that purpose. Table 1 shows the errors for various sizes. The ormalizatio error is processed by the approximatio covertig the (-1;0] rage to a approximated (-3;0] rage, ad it should be added to the error itroduced i this step, show betwee paretheses i Table Multiplicatio Error This block itroduces two differet errors. Oe is itroduced whe we trucate the costat. This error is multiplied by the maximum iput value, where this is less tha E, whe the iteger part of x has E bits. If we call K to the costat ad Kap to the approximated value Etv max =E Kap K (24) Note that this error is always egative uless we roud the costat istead of trucate it. Additioally (24) is ot valid whe E is 0, i this case we should use -0.5 istead of E. The other error is the result of the trucatio of the output value ad is i the (-1;0] rage. I additio, the errors from the previous blocks are affected by the multiplicatio; cosequetly, they are reduced Example We will show a example of the error estimatio to clarify the above metioed compoets. Let us suppose the followig setup: base 10, iput values 34.7 (usiged),

5 output decimals 10 ad the costat is trucated at the 12 th bit. From Table 1 we kow that the error at the output of the approximatio step is i the [-7;5] rage. The egative boudary of the error ca be obtaied multiplyig this error by Kap, ad the addig Etv together with the trucatio error. The value of Kap is ad usig (24) Table 1. Approximatio Error Bits Negative Error (ulp) Positive Error (ulp) =1233/ 4096 Etv max = ulp 10 The egative boudary of the error is = The positive boudary is ot affected by the trucatio = resultig i a expected rage of [-3;1] Error Balacig 8 3 (6) (7) (10) (11) (14) (25) 6 As the above aalysis shows, the core has a ubalaced error rage. This is because the core trucates various values, ad we could balace this error addig a costat to (20) A= N j Ecomp (25) I the previous example we could add 3 ulp chagig the error at the output of the approximatio from [-7;5] to [-4;8] obtaiig a fial error i the [-2;2] rage. This simple compesatio reduced the maximum absolute error from 3 to 2 for this example. Table 2. Implemetatio Results I Out Est. Error Meas. Error Clocks ;3-4;3 33/56/ ;3-4;3 33/64/ ;3-3;3 28/42/ ;2-2;2 23/63/ ;5-3;4 50/50/28 5. IMPLEMENTATION RESULTS I this work, we selected some represetative usiged fixed poit values (8.16, 16.16, 0.15); the particular case of our project (34.7 usig 10 bits for the approximatio) ad the equivalet to a sigle precisio floatig poit matissa (0.24). Furthermore, we selected the decimal logarithm. I the 0.24 case we assumed a already ormalized iput value. I the Xilix case we used ISE Webpack 7.1 (7.1.03i H.41) tool for sythesis, ad the Virtex 4 LX (xc4vlx ff668) device to measure the speed. We disabled the use of BRAMs ad DSP48 uits, selected area optimizatio ad set the optimizatio effort to high. We did ot use ay time costrai. Note that the cosumed resources (FFs ad LUTs) for the Virtex 4 are approximately the same for smaller devices like Sparta II. Table 2 shows the error estimated usig the method described i sectio four, the measured error ad the umber of clocks eeded to compute the result. We measured the error usig 100,000 radom values. The first two values for the clocks are the miimum ad maximum umber of clocks eeded to compute the result (latecy), the third value is the average umber of clock cycles betwee output results whe the core is costatly fed with radom values (throughput). Table 3 shows area usage ad maximum operatig frequecy reported by the tool after the place ad route. The values for the area are flip-flops, 4 iputs LUTs, umber of LUTs used as a route-thru ad slices. I the Actel case we used Libero 7.2 (Syplify Pro 8.5F, Build 001R ad Actel Desiger ), ad the RT54SX72S-1-208CQFP device. We selected a maximum faout of 20 to optimize the used area. Table 4 shows area usage ad maximum operatig frequecy reported by the tool after the place ad route. The values for the area are flip-flops, combiatioal cells ad the percetage of area used. Note that Actel devices use a simpler combiatioal uit, but they provide two of them for each flip-flop. Fig. 3 shows the measured error distributio for the 0.24 case, as metioed above, it was obtaied usig 100,000 radom values. The bar for -4 ulp ca ot be observed because oly four values preseted this error.

6 Table 3. Implemetatio Results (Xilix) I Out Area (FF/LUT/RLUT/Slices) Max. freq /281/ 8/ MHz /290/ 8/ MHz /251/ 8/ MHz /217/ 5/ MHz /409/13/ MHz Table 4. Implemetatio Results (Actel) I Out Area (Reg./Comb./Area%) Max. freq /500/10,8 % 33 MHz /505/11,1 % 32 MHz /446/9,5 % 32 MHz /372/8,5 % 39 MHz /747/15,3 % 23 MHz 8. REFERENCES Fig. 3 Error distributio for the 0.24/0.26 case. 6. CONCLUSION The proposed algorithm could be implemeted o FPGAs from ay vedor. As a result, it does ot rely o specific resources like BRAMs, embedded multipliers, etc. The used area was small. For example whe the 8.16 cofiguratio was used for a decimal logarithm, it was aroud 13% of a 100,000 gates equivalet device ad less tha 1/500 of a moder Virtex 4 LX 200. Whe we used the selected Virtex 4, this core computed a average of 9.78 millio values per secod for the 8.16 cofiguratio ad over tha 5 millio values, whe the output eeded 23 exact bits. The last case eeded three extra bits ad provided the precisio eeded for a sigle precisio floatig poit umber. The core showed a very good area-speed relatioship, ad for applicatios where the speed requiremets are bigger more tha oe core could be used i parallel. [1] M. Haselma, M. Beauchamp, A. Wood, et al, A Compariso of Floatig Poit ad Logarithmic Number Systems for FPGAs, Proc. of the 13th Aual IEEE Symp. o Field-Prog. Custom Comp. (FCCM'05) , Ja [2] Y. Wa ad C. L. Wey, Efficiet algorithms for biary logarithmic coversio ad additio, IEE Proc.-Comp. Digit. Tech., vol. 146, o. 3, May [3] T. C. Che, Automatic computatio of expoetial, logarithms ratios ad square roots, IBM J. Res. Develop., pp , Jul [4] H.-Y. Lo ad Y. Aoki, Geeratio of a precise biary logarithm with differece groupig programmable logic array, IEEE Tras. Comput., vol. C-34, pp , Aug [5] S.-Y. Shi, Shortcut to logarithms combie table lookup ad computatio, Comput. Desig., pp , May [6] I. Kore, Computer arithmetic algorithms, 2 d editio, ISBN , pp [7] D. K. Kostopoulos, A algorithm for the computatio of biary logarithms, IEEE Tras. o Comp., vol. 40, o. 11, /91, Nov [8] B. Parhami, Computer Arithmetic Algorithms ad Hardware Desigs, ISBN , pp [9] M. Pascale, Microcotrollers & CORDIC methods,, Dr. Dobb's Joural, Jul ACKNOWLEDGMENTS We would like to thak to Gustavo Sutter (UAM, Spai), he cotributed valuable ideas ad refereces.

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As