FPGA Implementation of Low-Area Floating Point Multiplier Using Vedic Mathematics

FPGA Implementation of Low-Area Floating Point Multiplier Using Vedic Mathematics R. Sai Siva Teja 1, A. Madhusudhan 2 1 M.Tech Student, 2 Assistant Professor, Dept of ECE, Anurag Group of Institutions (formerly CVSR College of Engineering), Ghatkesar, R. R Dist, A.P, India Abstract In this paper we describe an efficient implementation of an IEEE 754 single precision floating point multiplier using vedic mathematics.the purpose of using vedic mathematics is due to increase in the number of partial products in normal multiplication process,with using vedic mathematics partial products can be reduced so that the area and power constraints of the floating point multiplier can be reduced efficiently. Keywords-- floatingpoint; multiplication, FPGA, Nikhilamsutra, Radix selection unit, Vedic mathematics. I. INTRODUCTION Floating point numbers are one possible way of representing real numbers in binary format; the IEEE 754 [1] standard presents two different floating point formats, Binary interchange format and Decimal interchange format. Multiplying floating point numbers is a critical requirement for DSP applications involving large dynamic range. This paper focuses only on single precision normalized binary interchange format. Fig. 1 shows the IEEE 754 single precision binary format representation; it consists of a one bit sign (S), an eight bit exponent (E), and a twenty three bit fraction (M or Mantissa). An extra bit is added to the fraction to form what is called the significand 1. If the exponent is greater than 0 and smaller than 255, and there is 1 in the MSB of the significand then the number is said to be a normalized number; in this case the real number is represented by (1) Figure1. IEEE single precision floating point format Z = (-1 S ) * 2 (E - Bias) * (1.M) Where M = m 22 2-1 + m 21 2-2 + m 20 2-3 + + m 1 2-22 + m 0 2-23 ; Bias = 127. Multiplying two numbers in floating point format is done by 1- adding the exponent of the two numbers then subtracting the bias from their result, 2- multiplying the significand of the two numbers, and 3- calculating the sign by XORing the sign of the two numbers. In order to represent the multiplication result as a normalized number there should be 1 in the MSB of the result (leading one). Floating-point implementation on FPGAs has been the interest of many researchers. In [2], an IEEE 754 single precision pipelined floating point multiplier was implemented on multiple FPGAs (4 Actel A1280). In [3], a custom 16/18 bit three stage pipelined floating point multiplier that doesn t support rounding modes was implemented. In [4], a single precision floating point multiplier that doesn t support rounding modes was implemented using a digit-serial multiplier: using the Altera FLEX 8000 it achieved 2.3 MFlops. In [5], a parameterizable floating point multiplier was implemented using the software-like language Handel-C, using the Xilinx XCV1000 FPGA; a five stages pipelined multiplier achieved 28MFlops. In [6], a latency optimized floating point unit using the primitives of Xilinx Virtex II FPGA was implemented with a latency of 4 clock cycles. The multiplier reached a maximum clock frequency of 100 MHz. II. FLOATING POINT MULTIPLICATION ALGORITHM As stated in the introduction, normalized floating point numbers have the form of Z= (-1 S ) * 2 (E - Bias) * (1.M). To multiply two floating point numbers the following is done: 1. Multiplying the significand; i.e. (1.M 1 *1.M 2 ) 2. Placing the decimal point in the result 3. Adding the exponents; i.e. (E 1 + E 2 Bias) 4. Obtaining the sign; i.e. s 1 xor s 2 5. Normalizing the result; i.e. obtaining 1 at the MSB of the results significand 362

6. Rounding the result to fit in the available bits 7. Checking for underflow/overflow occurrence 1 Significand is the mantissa with an extra MSB bit. This research has been supported by Mentor Graphics. Consider a floating point representation similar to the IEEE 754 single precision floating point format, but with a reduced number of mantissa bits (only 4) while still retaining the hidden 1 bit for normalized numbers: A = 0 10000100 0100 = 40, B = 1 10000001 1110 = - 7.5 To multiply A and B 1. Multiply significand: 1.0100 1.1110 00000 1001011000 2. Place the decimal point: 10.01011000 3. Add exponents: 10000100 + 10000001 10000010 1 The exponent representing the two numbers is already shifted/biased by the bias value (127) and is not the true exponent; i.e. E A = E A-true + bias and E B = E B-true + bias And The result is (without the hidden bit): 1 10000111 00101100 6. The mantissa bits are more than 4 bits (mantissa available bits); rounding is needed. If we applied the truncation rounding mode then the stored value is: 1 10000111 0010. In this paper we present a floating point multiplier in which rounding support isn t implemented. Rounding support can be added as a separate unit that can be accessed by the multiplier or by a floating point adder, thus accommodating for more precision if the multiplier is connected directly to an adder in a MAC unit. Fig. 2 shows the multiplier structure; Exponents addition, Significand multiplication, and Result s sign calculation are independent and are done in parallel. The significand multiplication is done on two 24 bit numbers and results in a 48 bit product, which we will call the intermediate product (IP). The IP is represented as (47 downto 0) and the decimal point is located between bits 46 and 45 in the IP. The following sections detail each block of the floating point multiplier. E A + E B = E A-true + E B-true + 2 bias So we should subtract the bias from the resultant exponent otherwise the bias will be added twice. 100000101-01111111 10000110 4. Obtain the sign bit and put the result together: 1 10000110 10.01011000 5. Normalize the result so that there is a 1 just before the radix point (decimal point). Moving the radix point one place to the left increments the exponent by 1; moving one place to the right decrements the exponent by 1. 1 10000110 10.01011000 (before normalizing) 1 10000111 1.001011000 (normalized) Figure 2. Floating point multiplier block diagram III. VEDIC MATHEMATICS Vedic Mathematics is the ancient methodology of Indian mathematics which has a unique technique of calculations based on 16 Sutras (Formulae). A high speed complex multiplier design (ASIC) using Vedic Mathematics is presented in this paper. The idea for designing the multiplier and adder sub tractor unit is adopted from ancient Indian mathematics "Vedas". On account of those formulas, the partial products and sums are generated in one step which reduces the carry propagation from LSB to MSB. 363

The gifts of the ancient Indian mathematics in the world history of mathematical science are not well recognized. The contributions of saint and mathematician in the field of number theory, 'Sri Bharati Krsna Thirthaji Maharaja', in the fonn of Vedic Sutras (fonnulas) are significant for calculations. He had explored the mathematical potentials from Vedic primers and showed that the mathematical operations can be carried out mentally to produce fast answers using the Sutras. In this paper we are concentrating on "Nikhilam Navatascaramam Dasatah" fonnulas and other fonnulas are beyond the scope of this paper. IV. PROPOSED MULTIPLIER ARCHITECTURE The mathematical expression for the proposed algorithm is shown below. Broadly this algorithm is divided into three parts. (i) Radix Selection Unit (ii) Exponent Determinant (iii) Multiplier. Consider two n bit numbers X and Y. kl and k2 are the exponent of X and Y respectively. X and Y can be represented as: X= Z k 1 ± Z l (2) y = Z k 2 ± Z 2 (3) For the fast multiplication using Nikhilam sutra the bases of the multiplicand and the multiplier would be same, (here we have considered different base) thus the equation can be rewritten as Hardware implementation of this mathematics is shown in Fig. The architecture can be decomposed into three main subsections: (i) Radix Selection Unit (RSU) (ii) Exponent Determinant (ED) and (iii) Array Multiplier. The RSU is required to select the proper radices corresponding to the input numbers. If the selected radix is nearer to the given number then the multiplication of the residual parts (Zl xz2) can be easier to compute. The Subtractor blocks are required to extract the residual parts (ZI and Z2). The second subsection(ed) is used to extract the power (kl and k2) of the radix andit is followed by a subtractor to calculate the value of (k1-k2).the third subsection array multiplier [10] is used to calculate the product (Zl xzz). The output of the subtractor (klk2)and Zz are fed to the shifter block to calculate the value ofz2 x Zk1-k2.The first adder-subtractor block has been used to calculate the value of X ± Z2 x Zk1 -k2 The output of the first addersubtractor and the output of the second Exponent Determinant (k2) are fed to the second shifter block to compute the value of Zk2 x (X ± Z2 X Zk1-k2). The output of the multiplier (ZI xz2) and the output of the second shifter (Zk2 x (X ± Z2 x Zk1-k2))are fed to the second adder subtractor block to compute the value of (Zk2 x (X ± Z2 x Zk1-k2)) ± ZlZ2' Mathematical expression/or RSU Consider an 'n' bit binary number X, and it can be represented as X = Lf l Xi Zi Where XjE {O, l}. Then the values of X must lie in the rangezn-1 ::;; X < Zn. Consider the mean of the range is equals to 364

Exponent Determinant The hardware implementation of the exponent determinant is shown in Fig. 4.The integer part or exponent of the number from the binary fixed point number can be obtained by the maximum power of the radix. For the nonzero input, shifting operation is executed using parallel in parallel out (PIPO) shift registers. The number of select lines (in FigA it is denoted as S], So) of the PIPO shifter is chosen as per the binary representation of the number (N- 1)IO. 'Shift' pin is assigned in PIPO shifter to check whether the number is to be shifted or not (to initialize the operation 'Shift' pin is initialized to low). A decrementer [13] has been integrated in this architecture to follow the maximum power of the radix. A sequential searching procedure has been implemented here to search the first 'I' starting from the MSB side by using shifting technique. For an N bit number, the value (N-I)1O is fed to the input of decrementer. Fig-Hard ware implementation of Nikhilam Sutra The Block level architecture of RSU is shown in Fig. RSU consists of three main subsections: (i) Exponent Determinant (ED), (ii) Mean Determinant (MD) and (iii) Comparator. 'n' number bit from input X is fed to the ED block. The maximum power of X is extracted at the output which is again fed to shifter and the adder block. The second input to the shifter is the (n+i) bit representation of decimal '1'.If the maximum power of X from the ED unit is (n-i) then the output of the shifter is i"- I ). The adder unit is needed to increment the value of the maximum power of X by 'I'. The second shifter is needed to generate the value of 2".Here n is the incremented value taken from the adder block. The Mean Determinant unit is required to compute the mean of (zn- l + Zn). The Comparator compares the actual input with the mean value of (zn- l + zn). If the input is greater than the mean then 2" is selected as the required radix. If the input is less than the mean then 2"- 1 is selected as the radix. The select input to the multiplexer block is taken from the output of the comparator. Fig-Hardware implementation of RSU The decrementer is decremented based on a control signal which is generated by the searched result. If the searched bit is '0' then the control signal becomes low then decrementer start decrementing the input value (Here the decrementer is operating in active low logic). The searched bit is used as a controller of the decrementer. When the searched bit is 'I' then the control signal becomes high and the decrementer stops further decrementing and shifter also stops shifting operation. The output of the decrementer shows the integer part (exponent) of the number. 365

VI. CONCULSION AND FUTURE WORK From the above results we can clearly say that the area has been reduced as the utilization of CLB s and Flip Flops have been reduced by a fair bit. The carried out work can be further improved by using highspeed adders and subtsractors. Fig. Hardware implementation of exponent determinant V. SIMULATION RESULT ANALYSIS The result analysis is as shown below.due to the number of partial products have reduced the area is reduced in a fair amount.then no of CLB slices and the flipflops are reduced by a fair amount The simulation results of the existing and proposed system is as shown below Specifications Existing system Proposed system No of slices 604 356 No of Flip Flops 293 108 REFERENCES [1 ] IEEE 754-2008, IEEE Standard for Floating-Point Arithmetic, 2008. [2 ] B. Fagin and C. Renard, Field Programmable Gate Arrays and Floating Point Arithmetic, IEEE Transactions on VLSI, vol. 2, no. 3, pp. 365 367, 1994. [3 ] ARadhika pavan kumar UCS Saiprasad Goud A FPGA implementation of highspeed 8-bit vedic multiplier using barrel shifter [4 ] L. Louca, T. A. Cook, and W. H. Johnson, Implementation of IEEE Single Precision Floating Point Addition and Multiplication on FPGAs, Proceedings of 83 the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM 96), pp. 107 116, 1996. [5 ] A. Jaenicke and W. Luk, "Parameterized Floating-Point Arithmetic on FPGAs", Proc. of IEEE ICASSP, 2001, vol. 2, pp. 897-900. [6 ] B. Lee and N. Burgess, Parameterisable Floating-point Operations on FPGA, Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems, and Computers, 2002 [7 ] DesignChecker User Guide, HDL Designer Series 2010.2a, Mentor Graphics, 2010 [8 ] Precision Synthesis User s Manual, Precision RTL plus 2010a update 2, Mentor Graphics, 2010. [9 ] Patterson, D. & Hennessy, J. (2005), Computer Organization and Design: The Hardware/software Interface, Morgan Kaufmann. 366