Double Precision Floating-Point Arithmetic on FPGAs

Size: px

Start display at page:

Download "Double Precision Floating-Point Arithmetic on FPGAs"

Sydney Allen
6 years ago
Views:

1 MITSUBISHI ELECTRIC ITE VI-Lab Title: Double Precision Floating-Point Arithmetic on FPGAs Internal Reference: Publication Date: VIL04-D098 Author: S. Paschalakis, P. Lee Rev. A Dec Reference: Paschalakis, S., Lee, P., Double Precision Floating-Point Arithmetic on FPGAs, In Proc nd IEEE International Conference on Field Programmable Technology (FPT 03), Tokyo, Japan, Dec , pp , 2003 Double Precision Floating-Point Arithmetic on FPGAs Stavros Paschalakis, Peter Lee Abstract We present low cost FPGA floating-point arithmetic circuits for all the common operations, i.e. addition/subtraction, multiplication, division and square root. Such circuits can be extremely useful in the FPGA implementation of complex systems that benefit from the reprogrammability and parallelism of the FPGA device but also require a general purpose arithmetic unit. While previous work has considered circuits for low precision floating-point formats, we consider the implementation of 64-bit double precision circuits that also provide rounding and exception handling Mitsubishi Electric ITE B.V. - Visual Information Laboratory. All rights reserved.

2 Double Precision Floating-Point Arithmetic on FPGAs Stavros Paschalakis Mitsubishi Electric ITE BV VI-Lab Abstract We present low cost FPGA floating-point arithmetic circuits for all the common operations, i.e. addition/subtraction, multiplication, division and square root. Such circuits can be extremely useful in the FPGA implementation of complex systems that benefit from the reprogrammability and parallelism of the FPGA device but also require a general purpose arithmetic unit. While previous work has considered circuits for low precision floating-point formats, we consider the implementation of 64-bit double precision circuits that also provide rounding and exception handling. 1. Introduction FPGAs have established themselves as invaluable tools in the implementation of high performance systems, combining the reprogrammability advantage of general purpose processors with the speed and parallel processing advantages of custom hardware. However, a problem that is frequently encountered with FPGA-based system-on-achip solutions, e.g. in signal processing or computer vision applications, is that the algorithmic frameworks of most real-world problems will, at some point, require general purpose arithmetic processing units which are not standard components of the FPGA device. Therefore, various researchers have examined the FPGA implementation of floating-point operators [1-7] to alleviate this problem. The earliest work considered the implementation of operators in low precision custom formats, e.g. 16 or 18 bits in total, in order to reduce the associated circuit costs and increase their speed. More recently, the increasing size of FPGA devices allowed researchers to efficiently implement operators in the 32- bit single precision format, the most basic format of the ANSI/IEEE binary floating-point arithmetic standard [8], and also consider features such as rounding and exception handling. In this paper we consider the implementation of FPGA floating-point arithmetic circuits for all the common operations, i.e. addition/subtraction, multiplication, division and square root, in the 64-bit double precision format, which is most commonly used in scientific computations. All the operators presented here provide Peter Lee University of Kent at Canterbury P.Lee@kent.ac.uk rounding and exception handling. We have used these circuits in the implementation of a high-speed object recognition system which performs the extraction, normalisation and classification of moment descriptors and relies partly on custom parallel processing structures and partly on floating-point processing. A detailed description of the system is not given here but can be found in [9]. 2. Floating-Point Numerical Representation This section examines only briefly the double precision floating-point format. More details and discussions can be found in [8,10]. In a floating-point representation system of a radix β, a real number N is represented in terms of a sign s, with s=0 or s=1, an exponent e and a significand S so that N=( 1) s β e S. The IEEE standard specifies that double precision floating-point numbers are comprised of 64 bits, i.e. a sign bit (bit 63), 11 bits for the exponent E (bits 62 down to 52) and 52 bits for the fraction f (bits 51 to 0). E is an unsigned biased number and the true exponent e is obtained as e=e E bias with E bias =1023. The fraction f represents a number in the range [0,1) and the significand S is given by S=1.f and is in the range [1,2). The leading 1 of the significand, is commonly referred to as the hidden bit. This is usually made explicit for operations, a process usually referred to as unpacking. When the MSB of the significand is 1 and is followed by the radix point, the representation is said to be normalised. For double precision numbers, the range of the unbiased exponent e is [ 1022,1023], which translates to a range of only [1,2046] for the biased exponent E. The values E=0 and E=2047 are reserved for special quantities. The number zero is represented with E=0, and f=0. The hidden significand bit is also 0 and not 1. Zero has a positive or negative sign like normal numbers. When E=0 and f 0 then the number has e= 1022 and a significand S=0.f. The hidden bit is 0 and not 1 and the sign is determined as for normal numbers. Such numbers are referred to as denormalised. Because of the additional complexity and costs, this part of the standard is not commonly implemented in hardware. For the same reason, our circuits do not support denormalised numbers. An exponent E=2047 and a fraction f=0 represent infinity.

3 Table 1. Double precision floating-point operator statistics on a XILINX XCV1000 Virtex FPGA device *. Adder Multiplier Divider Square Root Slices 675 (5.49%) 495 (4.03%) 343 (2.79%) 347 (2.82%) Slice flip-flops input LUTs 1, Total equivalent gate count 10,334 8,426 6,464 5,366 * Device utilisation figures include I/O flip-flops: 194 for adder, 193 for multiplier, 193 for divider and 129 for square root. The sign of infinity is determined as for normal numbers. Finally, an exponent E=2047, and a fraction f 0 represent the symbolic unsigned entity NaN (Not a Number), which is produced by operations like 0/0 and 0. The standard does not specify any NaN values, allowing the implementation of multiple NaNs. Here, only one NaN is provided with E=2047 and f= Finally, a note should be made on the issue of rounding. It is clear that arithmetic operations on the significands can result in values which do not fit in the chosen representation and need to be rounded. The IEEE standard specifies four rounding modes. Here, only the default mode is considered, which is the most difficult to implement and is known as round-to-nearest-even (RNE). This is implemented by extending the relevant significands by three bits beyond their LSB (L) [10]. These bits are referred to, from the most significant to the least significant, as guard (G), round (R) and sticky (S). The fist two are normal extension bits, while the last one is the OR of all the bits that are lower than the R bit. Rounding up, by adding a 1 to the L bit, is performed when (i) G=1 and R S=1 for any L or (ii) G=1 and R S=0 for L=0. In other cases, truncation takes place. 3. Addition/Subtraction The main steps in the calculation of the sum or difference R of two floating-point numbers A and B are as follows. First, calculate the absolute value of the difference of the two exponents, i.e. E A E B, and set the exponent of the result to the value of the larger of the two exponents. Then, shift right the significand which corresponds to the smaller exponent by E A E B places. Add or subtract the two significands and S B, according to the effective operation, and make the result positive if it is negative. Normalise, adjusting as appropriate, and round, which may require to be readjusted. Clearly, this procedure is quite generic and various modifications exist. Because addition is most frequent in scientific computations, our circuit aims at a low implementation cost combined with a low latency. The circuit is not pipelined, so that key components may be reused, with a fixed latency of three clock cycles. Its overall organisation is shown in Figure 1. In the first cycle, the operands A and B are unpacked and checks for zero, infinity or NaN are performed. For now we can assume that neither operand is infinity or NaN. Based on the sign bits s A and s B and the original operation, the effective operation when both operands are made positive is determined, e.g. ( A ) ( B ) becomes ( A B ), which results in the same effective operation but with a sign inversion of the result. From this point, it can be assumed that both A and B are positive. The absolute difference E A E B is calculated using two cascaded adders and a multiplexer. Both adders are fast ripple-carry adders, using the dedicated carry logic of the device (here, fast ripple-carry will always refer to such adders). Implicit in this is also the identification of the larger of the two exponents, and this provisionally becomes the exponent of the result. The relation between the two operands A and B is determined based on the relation between E A and E B and by comparing the significands and S B, which is required if E A =E B. This significand comparison deviates from the generic algorithm given earlier but has certain advantages, as will be seen. The significand comparator was implemented using seven 8-bit comparators that operate in parallel and an additional 8-bit comparator which processes their outputs. All the comparators employ the fast carry logic of the device. If B>A then the significands and S B are swapped Both significands are then extended to 56 bits, i.e. by the G, R and S bits as discussed earlier, and are stored in registers. Swapping and S B is equivalent to swapping A and B and making an adjustment to the sign s R. This swapping requires only multiplexers. In the second cycle, the significand alignment shift is performed and the effective operation is carried out. The advantage of swapping the significands is that it is always S B which will undergo the alignment shift. Hence, only the S B path needs a shifter. A modified 6-stage barrel shifter wired for alignment shifts performs the alignment. Each stage in the barrel shifter can clear the bits which rotate back from the MSB to achieve the alignment. Also, each stage calculates the OR of the bits that are shifted out and cleared. This allows the sticky bit S to be calculated as the OR of these six partial sticky bits along with the value that is in the sticky bit position of the output pattern. The barrel shifter is organised so that the 32-bit stage is followed by the 16-bit stage and so on, so that the large ORs do not become a significant factor with

4 Operation A Unpack B LSB LSB+1 LSB Effective Operation s A s B E A E B S B Difference E B >E A Swap E B =E A Logic E A E B B=A B>A A, B =, NaN Input Pattern MSB 1 MSB MSB 1 MSB MSB MSB 1 MSB Leading-1 Pattern Sign Logic Swap Figure 2. Leading-1 detection s R B=A Effective Operation Pack R Leading 1 Detection Rounding Control Significand Add/Subtract Normalise Round Figure 1. Floating-point adder Align with respect to the speed of the shifter. The shifter output is the 56-bit significand S B, aligned and with the correct values in the G, R and S positions. A fast ripple-carry adder then calculates either +S B or S B according to the effective operation. The advantage of the significand comparison earlier is that the result of this operation will never be negative, since S B after alignment. The result of this operation is the provisional significand of the result and is routed back to the S B path. It is clear that will not necessarily be normalised. More specifically, setting aside =0, there are three cases: (a) is normalised (b) is subnormal and requires a left shift by one or more places and (c) is supernormal and requires a right shift by just one place. For the first two cases, a leading-1 detection component examines and calculates the appropriate normalisation shift size, equal to the number of 0 bits that are above the leading-1. Figure 2 shows a straightforward design for a leading-1 detector. The 56-bit leading-1 detector is comprised of seven 8-bit components and some simple connecting logic. For the final normalisation case, i.e. a supernormal in the range [2,4), the output of the leading-1 detector is overridden. This case is easily detected by monitoring the carry-out of the significand adder. Finally, if =0 then normalisation is, obviously, inapplicable. In this case, the leading-1 detector produces a zero shift size and is treated as a normalised significand. The normalisation of takes place in the third and final cycle of the operation. is normalised by the alignment barrel shifter, which is wired for right shifts. If is normalised, then it passes straight through the shifter unmodified. If is subnormal, it is shifted right and rotated by the appropriate number of places so that the normalisation left shift can be achieved. If is supernormal, it is right shifted by one place, feeding a 1 at the MSB, and the sticky bit S is recalculated. The output of the shifter is the normalised 56-bit with the correct values in the G, R and S positions. Then, rounding is performed as discussed earlier. The rounding addition is performed by the significand adder, with on the S B path The normalised and rounded is given by the top 53 bits of the result, i.e. MSB down to the L bit, of which MSB will become hidden during packing. A complication that could arise is the rounding addition producing the significand = However, no actual normalisation takes place because the bits MSB 1 down to L already represent the correct fraction f R for packing. Each time requires normalisation, the exponent needs to be adjusted. This relies on the same hardware used for the processing of E A and E B in the first cycle. One adder performs the adjustment arising from the normalisation of. If is normalised, passes through unmodified. If is subnormal, is reduced by the left shift amount required for the normalisation. If is supernormal, is incremented by 1. The second cascaded adder increments this result by 1. The two results are multiplexed. If the rounded is supernormal then the result of the second adder is the correct. Otherwise, the result of the first adder is the correct. The calculation of the sign s R is performed in the first cycle and is trivial, requiring only a few logic gates. Checks are also performed on the final to detect an overflow or underflow, whereby R is forced to the appropriate patterns before packing. Another check that is performed is for an effective subtraction with A=B, whereby R is set to a positive zero, according the IEEE standard. Finally, infinity or NaN operands result in an

5 infinity or NaN value for R according to a set of rules. These are not included here but can be found in [9]. Table 1 shows the implementation statistics of the double precision floating-point adder on a XILINX XCV1000 Virtex FPGA device of 6 speed grade. At 5.49% usage, the circuit is quite small. These figures also include 194 I/O synchronization registers. The circuit can operate at up to ~25MHz, the critical path lying on the significand processing path and comprised of 41.1% and 58.9% logic and routing delays respectively. Since the design is not pipelined and has a latency of three cycles, this gives rise to a performance of ~8.33 MFLOPS. Obviously, the circuit is small enough to allow multiple instances to be incorporated in a single FPGA if required. A, B = 0,, NaN s A s B Sign Logic s R A E A Add E A +E B E B Remove Excess Bias Unpack B S B Significand Multiply Normalise Rounding Control 4. Multiplication Round The most significant steps in the calculation of the product R of two numbers A and B are as follows. Calculate the exponent of the result as =E A +E B E bias. Multiply and S B to obtain the significand of the result. Normalise, adjusting as appropriate, and round, which may require to be readjusted. After addition and subtraction, multiplication is the most frequent operation in scientific computing. Thus, our double precision floating-point multiplier aims at a low implementation cost while maintaining a low latency, relative to the scale of the significand multiplication involved. The circuit is not pipelined and with a latency of ten cycles. Unlike the floating-point adder which operates on a single clock, this circuit operates on two clocks; a primary clock (CLK 1 ), to which the ten cycle latency corresponds, and an internal secondary clock (CLK 2 ), which is twice as fast as the primary clock and is used by the significand multiplier. Figure 3 shows the overall organisation of this circuit. In the first CLK 1 cycle, the operands A and B are unpacked and checks are performed for zero, infinity or NaN operands. For now we can assume that neither operand is zero, infinity or NaN. The sign s R of the result is easily determined as the XOR of the signs s A and s B. From this point, A and B can be considered positive. As the first step in the calculation of, the sum E A +E B is calculated using a fast ripple-carry adder. In the second CLK 1 cycle, the excess bias is removed from E A +E B using the same adder that was used for the exponent processing of the previous cycle. This completes the calculation of. The significand multiplication also begins in the second CLK 1 cycle. Since both and S B are 53-bit normalised numbers, will initially be 106 bits long and in the range [1,4). The significand multiplier is based on the Modified Booth s 2-bit parallel multiplier recoding method and has been implemented using a serial carry-save adder array and a fast ripple-carry adder for the assimilation of the final carry bits into the final sum bits. Pack R Figure 3. Floating-point multiplier With respect to the carry-save array, this contains two cascaded carry-save adders, which retire four sum and two carry bits in each CLK 2 cycle. For the 53-bit and S B, 14 CLK 2 cycles are required to produce all the sum and carry bits, i.e. until the end of the eighth CLK 1 cycle. Alongside the carry-save array, a 4-bit fast ripple-carry adder performs a partial assimilation, i.e. processes the four sum and two carry bits produced in the previous carry-save addition cycle, taking a carry-in from the previous partial assimilation. Taking into account the logic levels associated with the generation of the partial products required by Booth s method and the carry-save additions, the latency of this component is so small that it has no effect on the clocking of the carry-save array, while it greatly accelerates the speed of the subsequent carry-propagate addition. The results of these partial assimilations need not be stored; all that needs to be stored is their OR, since they would eventually have been ORed into the sticky bit S. In the ninth CLK1 cycle, the final sum and carry bits produced by the carry-save array are added together, taking a carry-in form the last partial assimilation. Since is in the range [1,4), it can be written as y 1 y 2.y 3 y 4 y 5 y 104 y 105 y 106. Bits y 1 to y 56 are produced by this final carry assimilation. Bits y 57 to y 106 don t exist as such; all we have is their OR, which we can write as y 57+, calculated during the partial assimilations of the previous cycles. Now, if y 1 =0, then is normalised and the 56-bit for rounding is given by y 2.y 3 y 4 y 5 y 54 y 55 y 56 y 57+. If y 1 =1, and requires a 1-bit right shift for normalisation, and the final 56-bit for rounding is given by y 1.y 2 y 3 y 4 y 53 y 54 y 55 y 56+, where y 56+ is the OR of y 56 and y 57+. The normalisation of a supernormal is achieved

6 using multiplexers switched by y 1. If is supernormal, is adjusted, i.e. incremented by 1, using the same adder that was previously used for the exponent processing. This adjustment does take place after it is determined that is supernormal but performed at the beginning of this cycle and then the adjusted either replaces the old or is discarded.. The rounding decision is also made in the ninth CLK1 cycle and without waiting for the final carry assimilation to finish. That is, a rounding decision is reached for both a normal and a supernormal. Then, the correct decision is chosen once y 1 has been calculated. In the tenth and final CLK 1 cycle, the rounding of is performed using the same fast ripple-carry adder that is used by the significand multiplier. The result of this addition is the final normalised and rounded 53-bit significand. As for the adder of the previous section, the complication that might arise is a supernormal after rounding. As before, no actual normalisation is needed because it would not change the fraction f R for packing The exponent, however, is adjusted, i.e. incremented by 1, using the same exponent processing adder. This adjustment is performed at the beginning of this cycle and then the correct is chosen between the previous or the adjusted based on the final. Checks are also performed on the final to detect an overflow or underflow, whereby R is forced to the correct bit patterns before packing. Finally, zero, infinity or a NaN operands result to a zero, infinity or NaN value for R according to a simple set of rules. Table 1 shows the implementation statistics of the double precision floating-point multiplier. The circuit is quite small, occupying only 4.03% of the device. The figures also include 193 I/O synchronization registers. The primary clock CLK 1 can be set to a frequency of up to ~40MHz, its critical path comprised of 36.4% and 63.6% logic and routing delays respectively, while the secondary clock CLK 2 can be set to a frequency of up to ~75MHz, its critical path comprised of 36.8% and 63.2% logic and routing delays respectively. Since the circuit is not pipelined with a fixed latency of ten CLK 1 cycles, a frequency of 37MHz and 74MHz for CLK 1 and CLK 2 respectively gives rise to a performance in the order of 3.7 MFLOPS. Obviously, the circuit is small enough to allow multiple instances to be placed in a single chip. 5. Division In general, division is a much less frequent than the previous operations. The most significant steps in the calculation of the quotient R of two numbers A (the dividend) and B (the divisor) are as follows. Calculate the exponent of the result as =E A E B +E bias. Divide by S B to obtain the significand. Normalise, adjusting as appropriate, and round, which may require to be readjusted. Our double precision floating-point divider A, B = 0,, NaN s A Sign Logic s R s B A E A E B Subtract E A E B Add Bias Unpack Pack R S B Significand Divide Normalise Round Figure 4. Floating-point divider B Rounding Control aims solely at a low implementation cost. A non-pipelined design is adopted, incorporating an economic significand divider, with a fixed latency of 60 clock cycles. Figure 4 shows the overall organisation of this circuit. In the first cycle, the operands A and B are unpacked For now, we can assume that neither operand is zero, infinity or NaN. The sign s R of the result is the XOR of s A and s B,. As the first step in the calculation of, the difference E A E B is calculated using a fast ripple-carry adder. In the second cycle, the bias is added to E A E B, using the same exponent processing adder of the previous cycle. This completes the calculation of. The significand division also begins in the second cycle. The division algorithm employed here is the simple non-performing sequential algorithm and the division proceeds as follows. First, the remainder of the division is set to the value of the dividend. The divisor S B is subtracted from the remainder. If the result is positive or zero, the MSB of the quotient is 1 and this result replaces the remainder. Otherwise, the MSB of is 0 and the remainder is not replaced. The remainder is then shifted left by one place. The divisor S B is subtracted from the remainder for the calculation of MSB-1 and so on. The significand divider calculates one bit per cycle and its main components are two registers for the remainder and the divisor, a fast ripple-carry adder, and a shift register for. The divider operates for 55 cycles, i.e. during the cycles 2 to 56, and produces a 55-bit, the two least significant bits being the G and R bits. In cycle 57, the sticky bit S is calculated as the OR of all the bits of the final remainder. Since both and S B are

7 normalised, will be in the range (0.5,2), i.e. if not already normalised, it will require a left shift by just one place. This normalisation is also performed in cycle 57. No additional hardware is required, since is already stored in a shift register. If requires normalisation, the exponent is incremented by 1 in cycle 58. This exponent adjustment is performed using the same adder that is used for the exponent processing of the previous cycles. Also in cycle 58, is transferred to the divisor register, which is connected to the adder of the significand divider. In cycle 59, the 56-bit is rounded to 53 bits using the significand divider adder. For a supernormal after rounding no normalisation is actually required but the exponent is incremented by 1 and this takes place in cycle 60 and using the same adder that is used for the exponent processing of the previous cycles. Checks are also performed on for an overflow or underflow, whereby the result R is appropriately set before packing. As for zero, infinity and NaN operands, R will also be zero, infinity or NaN according to a simple set of rules. Table 1 shows the implementation statistics of the double precision floating-point divider. The circuit is very small, occupying only 2.79% of the device, which also includes 193 I/O synchronization registers. This circuit can operate at up to ~60MHz, the critical path comprised of 42.8% and 57.2% logic and routing delays respectively. Since the design is not pipelined and has a fixed latency of 60 clock cycles, this gives a performance in the order of 1 MFLOPS. As for the previous circuits, the implementation considered here is small enough to allow multiple instances to be incorporated in a single FPGA device if needed. 6. Square Root The square root function is much less frequent than the previous operations. Thus, our floating-point square root circuit aims solely at a low implementation cost. A nonpipelined design is adopted with a fixed latency of 59 cycles. Figure 5 shows the organisation of this circuit. With the circuit considered here, the calculation of the square root R of the floating-point number A proceeds as follows. In the first cycle, the operand A is unpacked. For now we can assume that A is positive and not zero, infinity or NaN. The biased exponent of the result is calculated directly from the biased exponent E A using [9] EA if EA is even (and left shift one place) 2 ER = 1 EA if EA is odd 2 is calculated using a fast ripple carry adder, while the division by 2 is just a matter of discarding the LSB of the numerator, which will always be even. The calculation of A = Neg., 0,, NaN s A E A Calculation A Unpack Pack R Significand Square Root Round Rounding Control Figure 5. Floating-point square root the significand, i.e. of the square root of, starts in the second clock cycle. According to (1), will be in the range [1,4). Consequently, its square root will be in the range [1,2), i.e. it will always be normalised. Denoting as y 1.y 2 y 3 y 4, each bit y n is calculated using [9] 1 if ( X n Tn ) 0 y n = 0 otherwise ( X T ) 2 n n if yn = 1 X n+ 1 =, Tn+ 1 = y1. y2 y3 K yn 01 2X n if yn = 0 with X 1 =, T 1 = 0. 1 and n=1,2,3, 2 From (2) it can be seen that the adopted square root calculation algorithm is quite similar to the division method examined in the previous section. Based on this algorithm a significand square root circuit calculates one bit of per clock cycle. The main components of this part of the circuit are two registers for X n and T n, and a fast ripple-carry adder. The T n register has been implemented so that each flip-flop has its own enable signal, which allows each individual bit y n to be stored in the correct position in the register and also controls the reset and set inputs of the two next flips-flops in the register so that the correct T n is formed for each cycle of the process. After the significand square root calculation process is complete, the contents of register T n form the significand for rounding. Thus, the significand square root circuit operates for 55 cycles, i.e. during the clock cycles 2 to 56, and produces a 55-bit. The last two bits are the guard bit G and the round bit R. In cycle 57, the sticky bit S is calculated as the OR of all the bits of the final remainder of the significand square root calculation. Thus, a 56-bit for rounding is formulated. In cycle 58, is rounded to 53 bits using the 2

8 same adder that is used by the significand square root circuit. For a supernormal after rounding, no actual normalisation is required but the exponent is incremented by 1 and this adjustment takes place in cycle 59 and is performed using the same adder that is used for the exponent processing during the first cycle. From the definition of in (1) it is easy to see that an overflow or underflow will never occur. Finally, for zero, infinity, NaN or negative operands, simple rules apply with regards to the result R. Table 1 shows the implementation statistics of our double precision floating-point square root function. The circuit is very small, occupying only 2.82% of the device, which includes 129 I/O synchronization registers. The circuit can operate at up to ~80MHz, the critical path comprised of 53.0% and 47.0% logic and routing delays respectively. Since the circuit is not pipelined and has a fixed latency of 59 clock cycles, this gives rise to a performance in the order of 1.36 MFLOPS. 8. Discussion We have presented low cost FPGA floating-point arithmetic circuits in the 64-bit double precision format and for all the common operations. Such circuits can be extremely useful in the FPGA-based implementation of complex systems that benefit from the reprogrammability and parallelism of the FPGA device but also require a general purpose arithmetic unit. In our case, these circuits were used in the implementation of a high-speed object recognition system which relies partly on custom parallel processing structures and partly on floating-point processing. The implementation statistics of the operators show that they are very economical in relation to contemporary FPGAs, which also facilitates multiple instances of the desired operators into the same chip. Although non-pipelined circuits were considered here to achieve low circuit costs, the adder and multiplier, with a latency of three and ten cycles respectively, are suitable for pipelining to increase their throughput. For the divider and square root operators, pipelining the existing designs may not be the most efficient option and different designs and/or algorithms should be considered, e.g. a high radix SRT algorithm for division [6]. Clearly, there are significant speed and circuit size tradeoffs to consider when deciding on the range and precision of FPGA floating-point arithmetic circuits. A direct comparison with other floating-point unit implementations is very difficult to perform, not only because of floating-point format differences, but also due to other circuit characteristics, e.g. all the circuits presented here incorporate I/O registers, which would eventually be absorbed by the surrounding hardware. As an indication, and with some caution, the double precision floating-point adder presented here occupies approximately the same number of slices as the single precision floating-point adder in [7]. That circuit has a higher latency than the adder presented here, but also a higher maximum clock speed, which result in both circuits having approximately the same performance. The adder of [7], however, is fully pipelined and has a much higher peak throughput. In conclusion, the circuits presented here provide an indication of the costs of FPGA floating-point operators using a long format. The choice of floating-point format for any given problem ultimately rests with the designer. References 1. Shirazi, N., Walters, A., Athanas, P., Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines, In Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1995, pp Loucas, L., Cook, T.A., Johnson, W.H., Implementation of IEEE Single Precision Floating Point Addition and Multiplication on FPGAs, In Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1996,pp Li, Y., Chu, W., Implementation of Single Precision Floating Point Square Root on FPGAs, In Proc. 5 th IEEE Symposium on Field Programmable Custom Computing Machines, 1997, pp Ligon, W.B., McMillan, S., Monn, G., Schoonover, K., Stivers, F., Underwood, K.D., A Re-Evaluation of the Practicality of Floating Point Operations on FPGAs, In Proc. IEEE Symposium on FPGAs for Custom Computing Machines, 1998, pp Belanovic, P., Leeser, M., A Library of Parameterized Floating-Point Modules and Their Use, In Proc. Field Programmable Logic and Applications, 2002, pp Wang, X., Nelson, B.E., Tradeoffs of Designing Floating-Point Division and Square Root on Virtex FPGAs, In Proc. 11 th IEEE Symposium on Field- Programmable Custom Computing Machines, 2003, pp Digital Core Design, Alliance Core Data Sheets for Floating-Point Arithmetic, 2001, 8. ANSI/IEEE Std , IEEE Standard for Binary Floating-Point Arithmetic, Goldberg, D., What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys, 1991, vol. 23, no. 1, pp Paschalakis, S., Moment Methods and Hardware Architectures for High Speed Binary, Greyscale and Colour Pattern Recognition, Ph.D. Thesis, Department of Electronics, University of Kent at Canterbury, UK, 2001

A Library of Parameterized Floating-point Modules and Their Use

A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu