FLOATING POINT NUMBERS Robert P. Webber, Longwood University We have seen how decimal fractions can be converted to binary. For instance, we can write 6.25 10 as 4 + 2 + ¼ = 2 2 + 2 1 + 2-2 = 1*2 2 + 1*2 1 + 0*2 0 + 0*2-1 + 1*2-2 = 110.01 2. Teaching a computer how to do arithmetic using such binary fractions would be difficult. One problem is that the binary point is not fixed; it needs to float. If you multiply a two place fraction by another two place fraction, for instance, the result has four fractional places, not two. Computer scientists realized that it would be easier to do floating point arithmetic if the numbers were written in scientific notation. You may recall this notation from your science classes, where very large numbers and numbers that are very close to zero are written using powers of ten. For example, 1,234,000,000 = 1.123*10 9, 0.0000567 = 5.67*10-5. Computers use binary notation and powers of two, of course. The resulting format is called floating point notation. A floating point number has three parts: its sign, a fractional part, and an exponent: ± fractional_part * 2 exponent For instance, the decimal floating point number 5.16 * 2 13 has a positive sign, a fractional part of 5.16, and an exponent of 13. It is equivalent to 5.16 * 2 13 = 5.16 * 8192 = 42,279.72 in ordinary signed decimal form. There are many slightly different floating point formats. In an effort to bring order from chaos, the Institute of Electrical and Electronics Engineers (IEEE) developed a standard form called IEEE 754 single precision. Many computers use this standard, and it is the one we will examine.
The IEEE standard uses 32 bits for each floating point number, divided into three fields. The left most bit is the sign bit. The next eight bits hold the exponent. The final 23 bits contain the fractional part. Sign (1 bit) Exponent (8 bits) Fractional part (23 bits) The sign bit is 0 for a positive number or zero, and 1 for a negative number. The fractional part assumes the number is binary and in the form 1.xx x, where xx x denotes binary digits. This is called normalized form. Since the fractional part always begins with 1, there is no need to store that bit. Only the part after the binary point is stored. The assumed 1 is called a hidden bit, and it provides 24 bits of accuracy in only 23 bit spaces. The exponent must allow for a sign, since exponents can be positive or negative. It must also allow for quick comparison to other exponents, because many comparisons must be done in floating point arithmetic. Two s complement notation would provide the sign, but not quick comparison. To allow that, the IEEE form uses excess 127 notation. In base 10, Excess 127 exponent = signed decimal exponent + 127. For example, convert the decimal number 50.5 to IEEE format. Step 1: Convert 50.5 to base 2. 50.5 = 32 + 16 + 2 + ½ = 2 5 + 2 4 + 2 1 + 2-1 = 110010.1 2. Step 2: Write the number in normalized form. We must move the binary point five places to the left, so the exponent is 5. 110010.1 = 1.100101 * 2 5. The fractional part is 100101. We drop the left-most 1, because it is the hidden bit. Notice there is no need to write trailing zeros when we write the number by hand. When we write it as the computer will store it, however, we will need to add 17 trailing zeros to make 23 bits for the fractional part in all. The complete fractional part is 10010100000000000000000. Step 3: Find the excess 127 form of the exponent.
127 + 5 = 132 = 128 + 4 = 2 7 + 2 2 = 10000100 2 The sign is 0, since the number is positive; the excess 127 exponent is 10000100, and the fractional part is 100101 followed by 17 zeros. The IEEE form is 01000010010010100000000000000000, and this is how the number would be stored in the computer. We could write this as 424A0000 in hexadecimal form for better readability. The form may be clearer if we break it into fields. 0 10000100 10010100000000000000000 sign exponent fractional part Here s another example. Write 121.7510 in IEEE floating point format. Step 1: Convert 121.75 to binary. 121.75 = 64 + 32 + 16 + 8 + 1 + ½ + ¼ = 2 6 + 2 5 + 2 4 + 2 3 + 2 0 + 2-1 + 2-2 = 1111001.11 2 Step 2: Write the binary number in normalized form. 1111001.11 = 1.11100111 * 2 6 Step 3: Find the excess 127 exponent. 127 + 6 = 133 = 128 + 4 + 1 = 2 7 + 2 2 + 2 0 = 10000101 2 The sign is 1, the fractional part is 11100111 followed by 15 zeros (for a total of 23 bits), and the exponent is 10000101. The IEEE form is 11000010111100111000000000000000 or C2F38000 in hexadecimal. Broken up into fields, it is 1 10000101 11100111000000000000000 sign exponent fractional part
MAGNITUDE AND PRECISION Magnitude refers to the raw size of the number; that is, how large or small it can be. Precision refers to the number of digits of accuracy in a number. Magnitude and precision measure different things. Magnitude refers to the possible number of digits, precision to how many of those digits are accurate. Often we don t care much about the precision in very large numbers. For instance, the 2007 United States population was 302 million people. We don t really mean exactly 302,000,000, of course. The number is accurate to only three digits. Indeed, it would not be possible to be much more accurate, since the exact population is constantly changing. When a computer stores a number in integer format, every digit is accurate. The largest integer that can be stored in 32 bits using two s complement notation is 2 31-1, which is 2,147,483,648. Say an integer has value 15,431. We can be sure that each digit is correct. However, an integer such as 3,500,630,119 cannot be stored in 32 bits. It is too large. Floating point numbers generally do not have this precision property. The magnitude is determined by the exponent, while the precision is determined by the fractional part. All digits of a displayed floating point number may not be accurate. In IEEE format, the precision is 24 bits, including the hidden bit. This translates to about seven decimal digits of accuracy. The magnitude, however, is much larger. The biggest exponent in excess 127 notation that can be stored in eight bits is 127, so the largest number that can be represented has all 1 s in the fractional part and 127 10 in the exponent: 1.11 1 *2 127. This number is approximately 3.4 * 10 38. A number that is larger than 3.4 * 10 38 cannot be stored in a computer using standard IEEE format. This is a huge number, but it is possible to exceed it. For example, there are exactly 35 legal choices for each chess move, but the total number of choices grows exponentially to produce more than 10 50 possible board positions, a number too large for even a computer to hold. It is important to realize that while the magnitude allows us to store numbers, the precision may mean that not all digits are accurate. For instance, suppose the budget for a large corporation is $632,785,417.25. This number can be stored in standard floating point, because its magnitude is much less than 3.4 * 10 38, but only about seven digits of accuracy will be preserved. The stored value will be approximately $632,785,400.00. The last several digits will probably be lost. Other floating point formats are available that increase the precision (but probably not the magnitude). Regardless, you should always remember that a computer generated number
is only accurate to a maximum number of digits (seven for standard IEEE format). Any digits beyond that maximum will not be reliable. Exercises In problems 1 through 8, write the decimal number in normalized form; that is, in the form 1.xx x * 2 exponent. 1. 562 2. 961 3. 1055 4. 2050 5. 69 6. 120 7. 28.125 8. 106.25 In problems 9 through 12, find the excess 127 form of the base 10 number. 9. 7 10. 38 11. 8 12. 19 In problems 13 through 18, write the decimal number in IEEE floating point format. 13. 42.5 14. 105.375 15. 1 26 16. 145.625 4 17. 11/16 18. 15 / 32 In problems 19 through 26, can the quantity be stored as a 32 bit integer? as a standard IEEE floating point number? In each case, if it can be stored, will the stored number be accurate? Explain your answers. 19. The number of seconds in an hour. 20. The number of seconds in a day. 21. The number of seconds in a week.
22. The number of seconds in the month of March. 23. The number of seconds in a non-leap year. 24. The numbers of seconds in a century. 25. The number 3.141592674 (the first ten digits of the number π ). 26. The number 2.718281828 (the first ten digits of the number e). 27. The distance from the Sun to the Earth, expressed in miles. 28. The United States national debt. In problems 29 through 32, show how the two numbers would be stored, the first in binary 2 s complement integer form, the second in IEEE form. Assume 32 bits are used for each. 29. 35, 35.0 30. 6, 6.0 31. 14, 14.0 32. 161, 161.0 In problems 33 through 36, the bit pattern represents an IEEE floating point format number. Find its ordinary signed decimal form. 33. 01000001110110000000000000000000 (the pattern is in binary) 34. 3EA00000 ( the pattern is in hexadecimal) 35. BDE00000 (the pattern is in hexadecimal) 36. 11000010111100110100000000000000 (the pattern is in binary)