PROJECT REPORT IMPLEMENTATION OF LOGARITHM COMPUTATION DEVICE AS PART OF VLSI TOOLS COURSE

PROJECT REPORT ON IMPLEMENTATION OF LOGARITHM COMPUTATION DEVICE AS PART OF VLSI TOOLS COURSE Project Guide Prof Ravindra Jayanti By Mukund UG3 (ECE) 200630022

Introduction The project was implemented from the paper titled A Fast Hardware Approach for Approximate, Efficient Logarithm Computations which was authored by Suganth Paul, Nikhil Jayakumar, and Sunil P. Khatri and featured in the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 2, FEBRUARY 2009. The generation of elementary functions such as log() and antilog() finds uses in many areas such as digital signal processing (DSP), 3-D computer graphics, scientific computing, artificial neural networks, logarithmic number systems (LNS), and other multimedia applications. The approach presented in the paper provides a good solution for field-programmable gate array (FPGA)-based applications that require high accuracy with a low cost in terms of required lookup table (LUT) size. Such applications include LNS, DSP cores, etc. In fact the fast generation of these functions is critical to performance in many of these applications. Using software algorithms to generate these elementary functions is often not fast enough. Hence, the use of dedicated hardware to compute log() and antilog() is of great value. In this paper, the number format used is similar to the IEEE 754 singleprecision floating point format that has 32 bits. The leading bit is the sign bit, followed by an 8-bit exponent and a 23 bit mantissa. The value of a number, represented in this format is given by (1) X = ±1. m 2 E 127, 0 m 1. We use a similar number format representation, but assume the number of bits in the mantissa to be variable. We also assume that the number is positive since the logarithm of a negative number does not exist.

Previous Work One of the earliest approaches to approximate the binary logarithm of a number was given by Mitchell. In his method, the logarithm of a number is found by attaching the mantissa part of the number to the exponent part. This method is extremely easy to implement but gives an absolute error as high as 0.086 which is only 3.5 bits of accuracy. In recent times, most elementary functions are generated using LUTs. This approach, initially proposed by Brubaker, involves the computation of a function using a single LUT. The accuracy of the approach depends solely on the size of the LUT used. Kmetz proposes to store the error that occurs due to the Mitchell approximation in the table and add it to the mantissa of the number. This results in a further improvement in error (by 3 bits) with the overhead of an add operation. In our approach, we follow the same scheme of storing the error values of the Mitchell approximation in a table. Other methods make use of bigger LUTs to give more accurate results, with a good speed of computation. In our approach, we try to find the optimum table size to use for a required accuracy. Apart from these simple methods, there are several other complex methods for the generation of these elementary functions. References [3] and [6] use a LUT combined with a polynomial approximation to interpolate the function between many small intervals. Reference [3] also presents a trade off between the table sizes and the degree of the polynomial to use while interpolating. The problem with these methods is that there are multiplications and divisions involved in the computation.

Idea Presented in Paper This work uses a LUT-based approach combined with a linear interpolation to generate the logarithm of a number. The multiplication required in this linear interpolation is avoided, resulting in an area and delay reduction. A. Interpolation Approach Mitchell s approximation is given by the following equation: log 2 1 + m m, 0 m 1. (2) The error due to this approximation is given by E m = log 2 1 + m m, 0 m 1. (3) The error curve shown in is sampled at points (depending on the size of the LUT required). These values are rounded depending on the width of the word required and stored in the LUT. The LUT is addressed by the first bits of the mantissa portion of the number. Now we investigate the option of interpolating between the values stored in the table. This is done by the following equation: log 2 1 + m m + a + b a. n 1 2 k t, 0 m 1 (4) Here m is the mantissa part, a is the error value from the table accessed by the first bits and b is the next table value adjacent to a. Also k is the total number of MSBs in the mantissa used for the interpolation and n 1 is the decimal value of the last (k-t) bits of the mantissa. Essentially we find an error value from the table based on the first bits and interpolate between this value and the next value based on the remaining bits. The third term in (4) requires a multiplication. In order to circumvent the multiplication (which is expensive in terms of area and delay), we investigate the option of interpolating repeatedly between any two adjacent values stored in the table.

The only problem with this approach is that there are too many steps involved as all the mantissa bits are considered. Trying another approach, we investigate the case where a limited number of interpolations are done. We tabulate the maximum error incurred when the previous algorithm is implemented for t, t+1, and so on until t+8 mantissa bits and ignoring the rest. This is the same as doing different levels of interpolation from 0 to 8. In this case, the size of the LUT used is 64 words and the width of each word is 16 bits. The width of each word in the table is chosen in such a way that the accuracy is not reduced due to rounding. But now we see that 1 or 2 interpolations are not enough to give a good error performance. Reasonable accuracy is obtained for either 7 or 8 bits, but this requires as many interpolations, and therefore results in larger delays in computing the logarithm. In order to obtain better accuracy, we need to implement the multiplication of (b-a) and n 1. However, implementing multiplication is expensive in terms of area and delay. Therefore, we approximate the multiplication, so as to obtain good error performance as well as low delay and area utilization. B. More Effective Approach for Multiplication In this section, we propose a more efficient approach to do interpolation without the multiplication of (b-a) and n1 in (4). The essential idea is that the multiplication is simplified by taking the antilogarithm of the sum of the logarithm. In order to perform this operation with a small delay, we consider the following options. 1) log 2 (b-a) may be approximated by either of the following two options: a) Mitchell approximation b) LUT: For the LUT option the constants for each of the intervals is stored in the original LUT. Recall that we stored the error values obtained by using a Mitchel approximation for the log function. The values are stored along with the error values and are indexed by the same address lines. 2) Antilog of the expression may be approximated by either of the following two options: a) Mitchell approximation

b) LUT: To obtain the antilogarithm of a number by this method, we need to construct another LUT. This antilog LUT is utilized to compute the antilogarithm of a number, since the multiplication of and is performed by taking the antilogarithm of the sum of the logarithm. C. Architecture of Implementation Fig. 1 shows the block diagram of the log() engine which is essentially an implementation of (4). The architecture of the interpolator block is shown in Fig. 2. The exponent part of the input number trivially becomes the decimal part of the logarithm. This is because we assume that 0 m 1 in (1). Also, we assume that the number we operate on (which is expressed as m in (1) is positive. The implementation is pipelined, with 12 stages in the pipeline. We use a 16-bit mantissa and a by 16 bit LUT. The width of each word in the table is chosen as 16 bits so that the error due to rounding does not dominate the overall error. MANTISSA, m 16 bits INTERPOLATOR 13 16 16 ADDER 17 RESULT FIGURE 1

(t=6) MSBs (k-t)=7 Look Up Table A log(b-a) LOD MSBs 6 16 20 9 6 COMPARE N < 28 LUT Log n1 16 1 FIXED POINT ADDER 16 6 ANTILOG LUT ADDER 16 16 ADD/SUB 16 FIGURE 2

The width of the mantissa bits processed by each block in the architecture is shown in the diagrams. It is found that (b-a) changes sign from positive to negative for m > 0.4427. This is equivalent to comparing the decimal value of the first six bits of the mantissa with 28, as shown. Hence, if the first six bits have a value greater than 28, the comparator block sends a control signal to the ADD/SUB block instructing it to perform a subtract operation. The leading one detector (LOD) block is used to find the log(n1). The LOD detects the first bit that has a value 1. It then uses the remaining bits as the mantissa portion to access the lookup table. Since there are 7 bits given as input to the LOD in this example, the LOD finds the first position that has a value 1 and the remaining bits which can be as wide as 6 bits is used to access the LUT. The decimal part of which is indicated by the position of the first bit with value 1, is directly sent to the three input fixed point adder to be added to the decimal part of log(b-a). The output of the adder after the antilog stage has to be shifted to the right or left depending on the decimal output of the fixed point adder. Also there is a term of 2 7 in the denominator of (4) and this is accounted for by a constant right shift of 7 bits at the output of the adder of the antilog LUT. In Fig. 2, the blocks representing a LUT for log(n1) and a LUT for a are identical and can be implemented using a dual port RAM. Also note that, the address of each of the three LUTs shown in Fig. 2 has a width of 6 bits. In each case the 6 bit address word is obtained by rounding off the value of the next LSB in the word from which the address word is derived. Notes on Implementation This complete hardware design was modelled in Verilog using the technical environment of Xilinx ISE and Modelsim SE. Figure 1 was modelled as one block and Figure 2 was modelled as a complete and separate block. The blocks representing adders in the figures were implemented as the + (plus) operator. In this case the simulating environment searches for the most optimum adder present in its libraries and replaces the + operator with that corresponding adder thereby ensuring best possible performance.

Simulation Results Please note that the binary values shown here are of the fractional part and not the decimal part so the right hand side bits contribute lesser to the number and can be erroneous. 1. Input (in decimal) - 1.564 Log value from calculator - 0.64745 = 101001011011 Log value from simulation = 101000011001 2. Input (in decimal) - 1.334 Log value from calculator - 0.41718 = 011010101100 Log value from simulation = 011010101001 3. Input (in decimal) - 1.791 Log value from calculator - 0.84365 = 110101111111 Log value from simulation = 110010111100 Conclusion The simulated results and the computed results always have at least 10 bits of accuracy which is the target of the implementation.

References 1. A Fast Hardware Approach for Approximate, Efficient Logarithm Computations by Suganth Paul, Nikhil Jayakumar, and Sunil P. Khatri, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 2, FEBRUARY 2009. 2. J. N. Mitchell, Computer multiplication and division using binary logarithms, IRE Trans. Electron. Computers, vol. 11, pp. 512 517, Aug. 1962