Floating Point Arithmetic - PDF Free Download

Floating Point Arithmetic Floating point numbers are frequently used in many applications. Implementation of arithmetic units such as adder, multiplier, etc for Floating point numbers are more complex than for fixed point arithmetic. In the below sections we discuss the representation of Floating point numbers, the Algorithm for Floating point multiplication, VHDL Implementation of a floating point multiplier & the procedures for Floating point addition, subtraction & division are described. REPRESENTATION OF FLOATING POINT NUMBERS Data type representations in VHDL: Floating Point Types A floating point type has a set of values in a given range of real numbers. Examples of floating point type declarations are type TTL_VOLTAGE is range -5.5 to -1.4; type REAL_DATA is range 0.0 to 31.9; An example of an object declaration is variable LENGTH: REAL_DATA range 0.0 to 15.9;... variable LI, L2, L3: REAL_DATA range 0.0 to 15.9; LENGTH is a variable object of type REAL_DATA that has been constrained to take real values in the range 0.0 through 15.9 only. Notice that in this case, the range constraint was specified in the variable declaration itself. Alternately, it is possible to declare a subtype and then use this subtype in the variable declarations as shown. subtype RD16 is REAL_DATA range 0.0 to 15.9;... variable LENGTH: RD16;... variable Li, L2, L3: RD16; The range bounds specified in a floating point type declaration must be constants or locally static expressions. Floating -point literals are values of a floating point type. Examples of floating point literals are 16.26 0.0 0.002 3_1.4_2

Floating point literals differ from integer literals by the presence of the dot (. ) character. Thus 0 is an integer literal while 0.0 is a floating point literal. Floating point literals can also be expressed in an exponential form. The exponent represents a power of ten and the exponent value must be an integer. Examples are 62.3 E-2 5.0 E+2 Integer and floating point literals can also be written in a base other than 10 (decimal). The base can be any value between 2 and 16. Such literals are called based literals. In this case, the exponent represents a power of the specified base. The syntax for a based literal is Examples are base # based-value # -- form 1 base # based-value # E exponent -- form 2 2#101_101_000# represents (101101000)2 (360) in decimal, 16#FA# represents (FA)16 (11111010)2 (250) in decimal, 16#E#E1 represents (E)16* (16^1) 14* 16 (224) in decimal, 2#110.01 # represents (110.01)2 (6.25) in decimal. The base and the exponent values in a based literal must be in decimal notation. The only predefined floating point type is REAL. The range of REAL is again implementation dependent but it must at least cover the range -I.OE38 to +I.OE38 and it must allow for at least six decimal digits of precision.

Computation of floating point values Consider a floating point number N F x 2 E Assume that the fraction F and exponent E are allocated 4 bits each. Example 1. Compute the value of the floating point number N when F 0.101 & E 0101 Solution: In F the MSB represents the sign bit. Hence 0 represents a positive number. The magnitude of F is calculated as shown in the table. Sign MSB Decimal Point 2-1 0.5 2-2 0.25 2-3 0.125 Digits of F 0. 1 1 0 Inference Positive fraction F 0.5 + 0.25 0.75 i.e., F ½ + ¼ 5/8 Similarly the magnitude of E is calculated as shown in the table Sign MSB 2 2 4 2 1 2 2 0 1 Digits of E 0 1 0 1 Inference Positive exponent E 4 + 1 5 Hence the value of the floating point number N F x 2 E +5/8 x 2 5

Example 2. Repeat the above example for F 1.011 & E 1011 Note : for negative numbers represented in the 2 s complement form the magnitude of the number is computed as (sign bit value) + (magnitude of rest of bits). The magnitude of F is calculated as shown in the table. Magnitude Of bit position Sign (MSB) 2 0 1 Decimal Point 2-1 0.5 2-2 0.25 2-3 0.125 Digits of F 1. 0 1 1 Inference Negative fraction F -1 + ( 0.25 + 0.125) 0.375 i.e., F -1 + ( ¼ + 1/8) -1 + 3/8-5/8 Similarly the magnitude of E is calculated as shown in the table below Sign (MSB) 2 3 8 2 2 4 2 1 2 2 0 1 Digits of E 1 0 1 1 Inference Negative exponent E -8 + (2 + 1) -5 Hence the value of the floating point number N F x 2 E -5/8 x 2-5

Normalization of floating point numbers In order to utilize all the bits in F and have the maximum number of significant figures, F should be normalized so that its magnitude is as large as possible. If F is not normalized, we can normalize F by shifting it left until the sign bit and the next bit are different. Shifting F left is equivalent to multiplying by 2, so every time we shift we must decrement E by 1 to keep N the same. After normalization the magnitude of F will be as large as possible, since any further shifting would change the sign bit. In the following examples, F is normalized to start with and then it is normalized by shifting left.

Representations for number 0 (zero) Zero can be represented with a 4-bit exponent & a 4-bit fraction as 0.000 x 2-8 as shown in table below. Zero cannot be normalized, so F 0.000 when N 0. Any exponent could then be used; however, it is best to have a uniform representation of 0.We will associate the negative exponent with the largest magnitude with the fraction 0. In a 4-bit 2 s complement integer number system, the most negative number is 1000, which represents 8. Thus when F and E are 4 bits, 0 is represented by 0.000 x 2-8 (negative exponents implies a smaller number. For example 2-2 ¼ 0.25 is smaller than 2-1 ½ 0.5. ) The smallest magnitude of F is calculated as shown in the table. Magnitude Of bit position Sign (MSB) 2 0 1 Decimal Point 2-1 0.5 2-2 0.25 2-3 0.125 Digits of F 0. 0 0 0 Inference positive fraction F 0 Similarly the smallest magnitude of E is calculated as shown in the table below Sign (MSB) 2 3 8 2 2 4 2 1 2 2 0 1 Digits of E 1 0 0 0 Inference Negative exponent E -8 + (0) -8 Smallest nonzero positive number that can be represented with a 4-bit exponent & a 4-bit fraction is 0.001 x 2-8 0.125 x 2-8

Floating point operations Floating point Addition Consider the design of an adder for floating point numbers. Two floating point numbers will be added to form a floating point sum ; (F 1 X 2 E1 ) + (F 2 X 2 E2 ) FX2 E Assume that the numbers to be added are properly normalized and that the answer should be put in normalized form. In order to add two fractions, the associated exponents must be equal. Thus, if the exponents E 1, E 2, are different, we must unnormalize one of the fractions and adjust the exponent accordingly. To illustrate the process, we add F 1 x 2 E1 0.111 x 2 5 and F 2 x 2 E2 0.101 x 2 3 Since E 2, E 1, are different we unnormalize F 2 by shifting right two times and adding 2 to the exponent : F2 0.101 x 2 3 0.0101 x 2 4 0.00101 x 2 5 Note that shifting right one place is equivalent to dividing by 2, so each time we shift we must add 1 to the exponent to compensate. When the exponents are equal, we add the fractions. (0.111 x 2 5) + (0.00101 x 2 5 ) 01.00001 x 2 5 This addition caused an overflow into the sign bit position, so we shift right and add 1 to the exponent to correct the fraction overflow. The final result is F x 2 E 0.100001 x 2 6

When one of the fractions is negative, the result of adding fractions may be unnormalized, as illustrated in the following example: (1.100 x 2-2 ) + (0.100 x 2-1 ) (1.110 x 2-1 ) + (0.100 x 2-1 ) (after shifting F 1 ) 0.010 x 2-1 (result of adding fractions is unnormalized) 0.100 x 2-2 (normalized by shifting left and subtracting 1 from exponent) In summary, the steps required to carry ort floating-point addition are as follows: 1. If the exponents are not equal, shift the fraction with the smaller exponent right and add 1 to its exponent; repeat until the exponents are equal. 2. Add the fractions. 3. (a) If fraction overflow occurs, shift right and add 1 to the exponent to correct the overflow. (b) If the fraction is unnormalized, shift left and subtract 1 from the exponent until the fraction is normalized. (c) If the fraction is 0, set the exponent to the appropriate value. 4. Check for exponent overflow. Step 4 is necessary, since step 3a or 3b may produce an exponent overflow. If E 1 >> E 2 and F 2 is positive, F 2 will become all 0s as we right-shift F 2 to equalize the exponents. In this case, the result is F F 1 and E E 1, so it is a waste of time to do the shifting. If E 1 >> E 2 and F 2 is negative, F 2 will become all 1s (instead of all 0s) as we rightshift F 2 to equalize the exponents. When we add the fractions, we will get the wrong answer. To avoid this problem, we can skip the shifting when E 1 >>E 2 and set F F 1 and E E 1.

Similarly, if E 2 >> E 1, we can skip the shifting and set F F 2 and E E 2. For the 4-bit fractions is our example, if E 1 E 2 > 3, we can skip the shifting. Floating-point subtraction Floating-point subtraction is the same as floating-point addition, except in step 2 we must subtract the fractions instead of adding them. Floating-point Division The quotient of two floating-point numbers is (F 1 x 2 E 1) / (F 2 x 2 E 2) (F 1 / F 2 ) X 2 E 1 - E 2 F X 2 E Thus, the basic rule for floating-point division is divide the fractions and subtract the exponents. In addition to considering the same special cases as for multiplication (explained below), we must test for divide the fractions and subtract the exponents. In addition to considering the same special cases as for multiplication, we must test for divide by 0 before dividing. If F 1 and F 2 are normalized, then the largest positive quotient (F) will be 0.1111 /0.1000 01.111 which is less than 10 2, so the fraction overflow is easily corrected. For example, (0.110101 2 2 ) (0.101 2-3 ) 01.010 2 5 0.101 2 6 Alternatively, if F 1 > F 2, we can shift F 1 right before dividing and avoid fraction overflow in the first place.

FLOATING-POINT MULTIPLICATION.Given two floating point numbers, N1 F 1 x 2 E1 & N2 F 2 X 2 E2 the product N1 x N2 is (F 1 x 2 E1 ) x (F 2 x 2 E2 ) (F 1 x F 2 ) x 2 (E1+E2) F x 2 E The fraction part of the product is the product of the fractions - F F 1 x F 2, and the exponent part of the product is the sum of a the exponents E E1+E2. In the algorithm below we assume that F 1 and F 2 are properly normalized to start with, and also that the final result is to be normalized.

Special cases Though only multiplying the fractions and adding the exponents is required there are several special cases that must be considered. 1. if F (fraction part of product) is 0, the exponent E must be set to the largest negative value (1000). (refer section on representation of zero) 2. Fraction overflow If multiplication of 1 by 1 (1.000 x 1.000) is carried out the result should be + 1. Since we cannot represent + 1 as 2 s complement fraction, we call this as a special case a fraction overflow. To correct this situation, we set F ½ (0.100) and add 1 to the exponent E. This is justified, since 1 x 2 E ½ x 2 E+1. 3. Normalization of the product Consider the multiplication example given below (0.1x 2 E1 ) x (0.1 x 2 E2 ) 0.01 x 2 E1+E2 0.1x 2 E1+E2-1

In this example, we normalize the result (0.01 x 2 E1+E2 ) by shifting the fraction (0.01) left one place (i.e., F becomes 0.1) and subtracting 1 from the exponent to compensate. 4. Exponent overflow If the resulting exponent is too large in magnitude to represent in our number system (i.e., say with 4 bits in our case), we have an exponent overflow. (Sometimes, an overflow in the negative direction is referred to as an underflow). Since we are using 4- bit exponents, if the exponent is not in the range 1000 to 0111 (-8 to +7), an overflow has occurred. Since an exponent overflow cannot be corrected, an overflow, an overflow indicator should be turned on.

A flowchart for the floating point multiplier is shown in Figure 1. After the fraction multiply is completed, all the special cases must be tested for. Since we have assumed that F 1 and F 2 are normalized, the smallest possible magnitude for the product is 0.01, (as indicated in the preceding example). Therefore, only one left shift is required to normalize F. Figure 1

Hardware for the exponent adder and fraction multiplier The hardware required to implement the floating point multiplier consists of an exponent adder and a fraction multiplier. Exponent adder Since each floating point number has a 4-bit exponent, addition of two 4-bit exponents (E1 & E2) requires a 5-bit adder as shown below

Special cases: Examples of exponent overflow: 7 + 6 00111 + 00110 01101 13 (Maximum allowable value is 7) -7 + (-6) 11001 + 11010 10011-13 (Most negative allowable value is 8 ) When the exponents are added, an overflow can occur. If E 1 and E 2 are positive and the sum (E) is negative, or if E 1 and E 2 are negative and the sum is positive, the result is a 2 s complement overflow. However, this overflow might be corrected when 1 is added to or subtracted from E during normalization of fraction overflow. To allow for this case, we have made the X register 5 bits long. When E 1 is loaded into X, the sign bit must be extended so that we have a correct 2 s complement representation. Since there are two sign bits, if the addition of E 1 and E 2 produces an overflow, the lower sign bit will get changed, but the high order sign bit will be unchanged. Each of the above examples has an overflow, since the lower sign bit has the wrong value Correction of exponent overflow The input & output signals required for the exponent adder are as follows 1. Load Load E 1, E 2 into the appropriate registers 2. Adx Add exponents; this signal also starts the fraction multiplier. 3. SM8 Set exponent to minus 8 4. RSF Shift fraction right; also increment E. 5. LSF Shift fraction left; also decrement E. 6. V Overflow indicator.

FRACTION MULTIPLIER (refer previous chapter for block diagram & working of faster multiplier i.e, 2 s complement multiplier ). Since we are multiplying 3 bits plus sign by 3 bits plus sign, the result will be 6 bits plus sign. After the fraction multiply, the 7-bit result (F) will be the lower 3 bits of A concatenated with B. The control signals required for the fraction multiplier are 1. St Start the floating point multiplication. 2. Mdone Fraction multiply is done 3. Load F 1, F 2, into the appropriate registers (also clear A in preparation for multiplication) 4. Adx this signal starts the fraction multiplier. 5. RSF Shift fraction (F) right; 6. LSF Shift fraction left; 7. MDone Floating point multiplication is complete. The state graph for the multiplier control (Figure 4) is similar to the state graph of 2 s complement multiplier dealt in chapter design of networks for arithmetic operation, except the load state is not needed, because the registers are loaded by the main

controller. When Adx 1, the multiplier is started, and done is turned on when the multiplication in completed. Figure 4 : state graph for multiplier control MAIN CONTROLLER FOR FLOATING POINT MULTIPLICATION Fig.2 Block diagram showing the main controller & its I/O signals

The SM Chart (Fig.3) for the main controller shown in Fig 2 above of the floating point multiplier is based on the flowchart shown in Fig.1. In the SM chart the controller for the multiplier is a separate state machine, which is linked into the main controller. The SM chart uses the following inputs and control signals 1. St Start the floating point multiplication. 2. Mdone Fraction multiply is done 3. FZ Fraction is Zero 4. FV Fraction overflow. 5. Fnrom F is normalized. 6. EV Exponent overflow 7. Load Load F 1, E 1, F 2, E 2 into the appropriate registers (also clear A in preparation for multiplication) 8. Adx Add exponents; this signal also starts the fraction multiplier. 9. SM8 Set exponent to minus 8 10. RSF Shift fraction right; also increment E. 11. LSF Shift fraction left; also decrement E. 12. V Overflow indicator. 13. Done Floating point multiplication is complete. The SM chart for the main controller has four states. In S0, the registers are loaded when the start signal is 1. In S1, the exponents are added, and fraction multiply is started. In S2, wait until the fraction multiply is done and then test for special cases and take appropriate action. It may seem surprising that the tests on FZ, FV, and Fnorm can all be done in the same state, since they are done in sequence on the flowchart. However, FZ, FV, and fnorm are generated by combinational circuits that operate in parallel and hence can be tested in the same state. However, we must wait until the exponent has been

incremented or decremented at the next clock before we can check for exponent overflow in S3, In S3, the Done signal is turned on and the controller waits for ST 0 before returning to S0. Figure 3: SM Chart for floating-point Multiplication

The VHDL behavioral description in the program uses three processes. The main process generated control signals based on the SM chart. A second process generates the control signals for the fraction multiplier. The third process tests the control signals and updates the appropriate registers on the rising edge of the clock. In state S2 of the main process, A 0000 implies that F 0 (FZ 1 on SM chart). If we multiply 1,000 x 1.000, the result is A&B 01000000, and a fraction overflow has occurred (FV 1). If A(2) A(1), i.e., the sign bit of F and the following bits are the same then F is unnormalized (Fnorm 0). In state S3, if the two high order bits of X are different, an exponent overflow has occurred (EV 1)