# Floating-Point Numbers in Digital Computers

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 POLYTECHNIC UNIVERSITY Department of Computer and Information Science Floating-Point Numbers in Digital Computers K. Ming Leung Abstract: We explain how floating-point numbers are represented and stored in modern digital computers. Directory Table of Contents Begin Article Copyright c 2000 Last Revision Date: February 3, 2005

2 Table of Contents 1. Introduction to Floating-Point Number Systems 1.1. Specification of a Floating-Point System 1.2. Characteristics of a Floating-Point System 1.3. A Simply Example of A Floating-Point System 2. Rounding Rules 2.1. Machine Precision 2.2. Subnormals and Gradual Underflow 3. Exceptional Values 3.1. Floating-Point Arithmetic 3.2. Example: Floating-Point Arithmetic 4. Cancellation 4.1. Example: Quadratic Formula 4.2. Standard Deviation

3 Section 1: Introduction to Floating-Point Number Systems 3 1. Introduction to Floating-Point Number Systems Most quantities encountered in scientific and engineering computing are continuous and are therefore described by real or complex numbers. The number of real numbers even within the smallest interval is uncountably infinite, and therefore real numbers cannot be represented exactly on a digital computer. Instead they are approximated by what are known as floating-point numbers. A floating-point number is similar to the way real numbers are expressed in scientific notation in the decimal system. For example the speed of light in vacuum, c, is defined to have the value c = in meters per second. The part containing is called the mantissa. The dot is the decimal point. The base (or radix) is 10 and the exponent is 8. This value can be rewritten in many alternate ways, such as c = , c = , and c = The decimal point moves or floats as we desire, that is why these numbers are called floating-point numbers. When the decimal point floats one unit to the left, the exponent increases by 1, and when the

4 Section 1: Introduction to Floating-Point Number Systems 4 decimal point floats one unit to the right, the exponent decreases by 1. One can make the floating-point representation of any number, except 0, unique by requiring the leading digit in the mantissa to be nonzero and the decimal point to appear always right after the leading digit. Zero is treated separately. The resulting floating-point system is then referred to as normalized. There are good reasons for normalization: 1. representation of each integer is unique 2. no digits wasted on leading zeros 3. additional advantage in binary system: leading digit is always a 1 and therefore need not be stored The scientific notation to represent a number can be generalized to bases other than 10. Some useful ones are shown in the following table.

5 Section 1: Introduction to Floating-Point Number Systems 5 base name application 2 = 2 1 binary practically all digital computers 8 = 2 3 octal old Digital computers 10 decimal HP calculators 16 = 2 4 hexadecimal PCs and workstations In the above example for the number representing the speed of light, the mantissa has 9 digits and the exponent has 1 digit Specification of a Floating-Point System In general a floating-point system is characterized by 4 positive integers base or radix precision smallest exponent allowed largest exponent allowed β p L U

6 Section 1: Introduction to Floating-Point Number Systems 6 A floating-point number, x, is represented as ( x = ± d 0 + d 1 β + d 2 β d ) p 1 β p 1 β E, where for i = 0, 1,..., p 1, d i is an integer between 0 and β 1, and E is an integer between L and U. The plus or minus signs indicate the sign of the floating-point number. Some terminologies: mantissa exponent fractional part d 0 d 1... d p 1 E d 1 d 2... d p 1 The following table lists some typical floating-point systems. System β p L U IEEE Single-precision IEEE Double-precision Cray HP Calculator IBM Mainframes

7 Section 1: Introduction to Floating-Point Number Systems 7 PCs and workstations typically use the IEEE floating-point systems. Clearly a floating-point system can only contain a finite set of discrete numbers. The total number of normalized floating-point numbers is: 2(β 1)β p 1 (U L + 1) for the 2 possible signs β 1 since d 0 can be any integer between 1 and β 1 β p 1 each of the p 1 fractional digit d 1 d 2... d p 1 can be any integer between 0 and β 1 U L + 1 is the total number of different exponents We add 1 to the product of the above factors since the number 0 is treated separately Characteristics of a Floating-Point System Out of all those numbers, two numbers have important meanings. The smallest positive normalized floating-point number is called the un-

8 Section 1: Introduction to Floating-Point Number Systems 8 derflow level, UFL. UFL must have the smallest normalized mantissa of , and the smallest exponent of L, and so It has a value of UF L = β L. UF L = , in IEEE double-precision. In MATLAB this number is named realmin. The largest positive floating-point number is called the overflow level, OFL. It has a mantissa having the largest value. Therefore the leading digit as well as remaining ones must have the largest value of β 1, i.e. d 0 = d 1 =... = d p 1 = β 1. The mantissa then has a value of β 1 + β 1 + β 1 b b β 1 b p 1. To see how we can sum up this finite series, we consider a particular example where β = 10 and p = 5. We see that the series gives =

9 Section 1: Introduction to Floating-Point Number Systems 9 Notice that by adding to the result, we get 10, which is β. Written in terms of β and p, we see that = 1/β p 1. The sum of the series must therefore be given by β 1 β p 1 = ( 1 1 β p ) β. The OFL must have the largest exponent of U. Therefore we have ( OF L = 1 1 ) ( β p 1 ββ U = 1 1 ) β p β U+1. In IEEE double-precision, OF L = ( 1 1 ) , which is slightly less than An approximate value for the OFL is Ideally a floating-point system should have a large p for high precision, a large and negative L, so that numbers with very small magnitudes can be represented, and a large and positive U so that numbers with large magnitudes can be represented. Toc Back Doc Doc

10 Section 1: Introduction to Floating-Point Number Systems A Simply Example of A Floating-Point System To obtain more insight into the floating-point system, we will now consider in detail a floating-point system, where β = 2, p = 3, L = 1, and U = 1. This is a toy system, so simple that we can write down all of its numbers. First 0 is always present in any floating-point number system. For every positive floating-point number is a corresponding negative number. So we can concentrate on getting all the positive floatingpoint numbers first and then reverse their signs to get all the negative floating point numbers. Immediately to the right of 0 is the UFL, which is given by UF L = β L = 2 1 = 1 2 = (0.5) 10, or equivalently by (1.00) This floating-point number is shown by the red mark in the following diagram. The next floating-point number (also in red) is (1.01) This number has a value (1.00) (0.01) = UF L+ 1 8 = (0.625) 10. The next floatingpoint number (also in red) is (1.10) = UF L = (0.75) 10. Toc Back Doc Doc

11 Section 1: Introduction to Floating-Point Number Systems 11 And the next one is (1.11) = UF L = (0.875) 10, also shown in red. The mantissa reaches its highest possible value. For the next number, the exponent increases by 1 and the mantissa returns to its smallest normalized value of (1.00) 2. The number (in black) is (1.00) = (1) 10. The next number (also in black) is (1.01) = (1) 10 + (0.01) = (1) = The next 2 numbers (also in black) are (1.10) = (1) = (1.5) 10 and (1.11) = (1) = (1.75) 10. Again the mantissa reaches its highest possible value. For the next number, once again the exponent has to increase by 1 and the mantissa has to return to its smallest normalized value of (1.00) 2. The number (in green) is (1.00) = (2) 10. The next 3 numbers (also in green) are (1.01) = (2) 10 + (0.01) = (2) = (2.5) 10, (1.01) = (2) = (3) 10, and (1.11) = (2) = (3.5) 10 = OF L. There is no ordinary floating-point number larger than OFL. Therefore this system has a total of 12 positive floating-point numbers. Putting a negative sign in front of these numbers gives 12 negative floating-point numbers. Together with 0, this system has a total Toc Back Doc Doc

12 Section 1: Introduction to Floating-Point Number Systems 12 of 25 floating-point numbers. Notice that except for 0, and depends on how they are grouped (disjoint groups or as overlapping groups), all the floating-point numbers appear in groups of 4 (= β p 1 ) or 5 (= β p 1 +1). Floating-point numbers are not distributed uniformly, except for numbers within the same group UFL OFL In summary, a normalized floating-point system has the following characteristics. 1. can only represent a total of 2(β 1)β p 1 (U L+1)+1 floatingpoint number exactly. 2. these number are discrete (not like the real numbers which are continuous) 3. these numbers are unevenly spaced except within the same group 4. 0 is always represented (also whole numbers with magnitudes Toc Back Doc Doc

13 Section 2: Rounding Rules 13 = β p ) 5. the smallest positive normalized floating-point number is UFL = β L ( ) 6. the largest positive floating-point number is OFL = 1 1 β β U+1 p 7. most other real numbers are not represented 8. those real numbers that can be represented exactly are called machine numbers 9. Basically all normalized floating-point systems exhibit the same characteristics. System appears less grainy and less uneven as p becomes large, L becomes more negative, and U becomes larger 2. Rounding Rules Most real numbers are non-machine numbers. How can they be treated? Even if we start with machine numbers, most of the time operations involving them will result in non-machine numbers. For example, in the above toy floating-point system, the numbers 2 and

14 Section 2: Rounding Rules are machine numbers. However 2 divided by 2.5 gives a true value of 0.8, but 0.8 is not a machine number. It falls between two machine numbers, 0.75 and If x is a real non-machine number, then it is approximated by a nearby machine number, denoted by f l(x). This process is called rounding. The error introduced by rounding is called rounding or roundoff error. Two commonly used rounding rules are: chop truncate base-β expansion of x after the (p 1)st digit. This rule is called round toward zero. round-to-nearest f l(x) is given by the floating-point number nearest to x. In case of a tie, use the one whose last stored digit, the so-called least significant digit (LSD) is even. This rule is called round-to-even. Of these two rules, round-to-nearest is more accurate (in the sense described in the next section) and is the default rounding rule in IEEE systems. We will consider the round-to-nearest rule here. Again using the toy floating-point system as an example, we see Toc Back Doc Doc

15 Section 2: Rounding Rules 15 that in the calculation = 1.75, all the numbers involved are machine numbers. The calculation is therefore performed exactly. We are extremely lucky. On the other hand, the true sum of the machine numbers and 1.5 is 2.375, which lies between the two consecutive machine numbers, 2 and 2.5. Since lies closer to 2.5 than to 2, it is approximated by 2.5. Therefore gives 2.5 in this floating-point system. As another example, the true sum of machine numbers and 1 is This number lies exactly between the two consecutive machine numbers, 1.5 and 1.75, which have representations (1.10) and (1.11) 2 2 0, respectively. The round-to-nearest rule therefore picks 1.5 as the answer, since its LSD is even Machine Precision Accuracy of a floating-point system is characterized by ɛ mach, referred to as the machine epsilon, machine precision, or unit roundoff. It is defined to be the smallest number ɛ such that fl(1 + ɛ) > 1. Toc Back Doc Doc

16 Section 2: Rounding Rules 16 Since 1 is a machine number represented by ( ) β 0, and the next machine number larger than 1 is ( ) β 0 = 1+ 1 β. This p 1 number is larger than 1 by β 1 p. Therefore in the round-to-nearest rule ɛ mach = β 1 p /2. The machine epsilon is sometimes also defined to be the maximum possible relative error in representing a nonzero real number x in a floating-point system. That means that the maximum relative error in representing any number is bounded by the machine epsilon: fl(x) x x < ɛ mach. For our toy floating-point system, ɛ mach = /2 = For the IEEE floating-point system, in single precision ɛ mach = /2 = , i.e. about 7 digits of precision, and in double precision ɛ mach = /2 = , i.e. about 16 digits of precision. Note that ɛ mach is much larger than the UFL in all practical floating-point number system.

17 Section 2: Rounding Rules 17 MATLAB defines eps as the distance between 1 and the next larger floating-point number. That distance is β 1 p. Thus we see that eps is 2ɛ mach. Thus we see that a real number x that is slightly larger than 1+ɛ mach is represented as fl(x) = 1+eps = 1+2 ɛ mach, the next machine number larger than 1. However if x is either equal to or slightly less than 1 + ɛ mach then it is represented as fl(x) = 1. On the other hand, the next machine number less than 1 is 1 eps/2 = 1 ɛ mach. Thus a real number x that is slightly less than 1 ɛ mach /2 is then represented as fl(x) = 1 eps/2 = 1 ɛ mach, the next machine number smaller than 1. However if x is slightly larger than or equal to 1 ɛ mach /2, then it is represented as fl(x) = Subnormals and Gradual Underflow Normalization causes the floating-point system to have no numbers other than 0 in the interval [ UF L, UF L]. Thus any number with a magnitude less than UF L/2 = β L /2 at any step in a calculation is set to 0. Most of the floating-point systems, including the single- and double-precision IEEE systems, attempt to fill in the above interval

18 Section 2: Rounding Rules 18 by allowing subnormal or denormalized floating-point numbers. These numbers are introduced by relaxing the normalization requirement by allowing the leading digits in the mantissa to be zero if the exponent is at its lowest value L. These subnormal floating-point numbers added to the system all have the form (0.d 1 d 2... d p 1 ) β L, where d 1, d 2... d p 1 are integers between 0 and β 1. (The number for which all the fractional digits are zero is clearly the floating-point number 0, which has been considered before and is not a subnormal number.) Since there are p 1 fractional digits and each digit can take on β possible values, thus by allowing for subnormals we add a total of 2(β p 1 1) floating-points to the system. The factor of 2 comes form the 2 possible signs, and we subtract 1 because 0 was considered already. The smallest subnormal floating-point number is ( ) β β L = 1 β p 1 βl = β L p+1.

19 Section 2: Rounding Rules 19 Now any floating-point number whose magnitude is less than or equal to β L p+1 /2 (rather than UF L/2 = β L /2) is then set to 0. This augmented system is said to exhibit gradual underflow. Note that ɛ mach still has the same value as before. Subnormal numbers extend the range of magnitudes representable but they have less precision than the normalized ones since 1 or more of their leading digits are zero. For our toy floating-point system, since β = 2 and p = 3, the total number of subnormals is 2(β p 1 1) = 6. Thus subnormals add three numbers on each side of the origin. The three positive ones are (0.01) = 1 8 (0.10) = 2 8 = 1 4 (0.11) = 3 8 The floating-point numbers of the entire floating-point system are shown in the following diagram. The subnormals are shown in yellow.

20 Section 2: Rounding Rules UFL OFL We will now consider examples with subnormals in IEEE doubleprecision systems. We define two normalized real numbers a = and b = Clearly both numbers lie between the UFL and the OFL, and therefore have full precision given by ɛ mach. Notice that b is very small, only slightly larger than the UFL ( ). We now compute the ratio r = b/a using for example MATLAB. The result to 16 digit precision is r = , which is smaller than the UFL and is clearly a subnormal. This number cannot have full precision. In fact the correct answer for r is ( ) ( ) = The relative error for r is 0.012, and that is much much larger than ɛ mach. The smallest subnormal is s = UF L/(2 52 ), which has a value of Any number whose magnitude is less than of equal to half of s is rounded to zero. One can check to see

21 Section 3: Exceptional Values 21 that s/ gives s again, and s/2 gives Exceptional Values IEEE floating-point standard provides special values to indicate two exceptional situations: Inf, which stands for infinity, results from dividing a finite number by zero, such as 1/0 NaN, which stands for not-a-number, results from undefined or indeterminate operations such as 0/0, 0, or / Inf and NaN are implemented in IEEE arithmetic through special reserved values of the exponent field Some languages like MATLAB can sensibly handle and propagate these exceptional values through a calculation. For example, Inf - 1 = Inf, 5* NaN +7 = NaN. Data such as x y NaN can be plotted in MAT-

22 Section 3: Exceptional Values 22 LAB. The third data point is omitted in the plot, with a warning about having a NaN in the data Floating-Point Arithmetic Addition or subtraction: The exponents of two floating-point numbers must be made to match by shifting the mantissas before they can be added or subtracted. But shifting the mantissa may cause loss of some and possibly even all the digits of the smaller number. Multiplication: Product of two p-digit mantissas contains up to 2p digits, so result may not be representable Division: Quotient of two p-digit mantissas may contain more than p digits, such as nonterminating binary expansion of 1/10 Result of floating-point arithmetic operation may differ from the result of the corresponding real arithmetic operation on the same operands.

23 Section 3: Exceptional Values Example: Floating-Point Arithmetic Assume we are using a floating-point system where β = 10, p = 6. Let x = , y = , floating-point addition gives the true result = But the floating-point system has only a precision of p = 6, therefore the result is , assuming rounding to nearest. The calculation was done by shifting the decimal point in the mantissa of the smaller number so that its exponent matches with that of the larger number. We can also do the calculation by shifting the decimal point of the larger number so that its exponent matches with that of the smaller number. This gives the true result = Rounding to nearest then gives the same result as the previous calculation Notice that the last two digits of y do not affect the final result, and with even smaller exponent, y could have had no effect on the result at all. Multiplication gives the true result x y = Rounding to nearest then gives the floating-point multiplication value

24 Section 3: Exceptional Values 24 of Notice that half of the digits of the true product are discarded. Division is similar to multiplication except that the result of a division by two machine numbers may not be a machine number. For example, 1 is a machine number represented by ( ) β β 0, and 10 is a machine number represented by ( ) β β 3. But 1 divided by 10 is 0.1 and it is not a machine number. In fact it has a non-terminating (repeating) representation ( ) β β 4, which is the counterpart of repeating decimals in a decimal system. Real result may also fail to be representable because its exponent is beyond available range. Over flow is usually more serious than under flow because there is no good approximation to arbitrarily large magnitudes in a floatingpoint system, whereas zero is often reasonable approximation for arbitrarily small magnitudes. On many computer systems over flow is fatal, but an under flow may be silently set to zero. Ideally, x flop y = (x op y), i.e., floating point arithmetic operations produce correctly rounded results. Computers satisfying IEEE Toc Back Doc Doc

25 Section 4: Cancellation 25 floating-point standard achieve this ideal as long as x op y is within range of floating-point system. But some familiar laws of real arithmetic not necessarily valid in floating-point system. Floating-point addition and multiplication commutative but not associative. Example: if ɛ is positive floating-point number slightly smaller than ɛ mach, (1 + ɛ) + ɛ = 1; but 1 + (ɛ + ɛ) > Cancellation Subtraction between two p-digit numbers having the same sign and magnitudes differing by less than a factor of 2, the leading digit(s) will cancel. The result will have fewer number of significant digits, although it is usually exactly representable. Cancellation loses the most significant (leading) bit(s) and is therefore much worst than rounding, this loses the least significant (trailing) bit(s). Again assume we are using a floating-point system where β = 10, p = 6. Let x = , z = , floatingpoint difference of these numbers gives the true result ,

26 Section 4: Cancellation 26 which has only 3 significant digits! It can therefore can be represented exactly as Despite exactness of result, cancellation often implies serious loss of information Operands often uncertain due to rounding or other previous errors, so relative uncertainty in difference may be large Example: if ɛ is positive floating-point number slightly smaller than ɛ mach, (1 + ɛ) (1 ɛ) = 1 1 = 0, in floating-point arithmetic. The true result of the overall computation, 2ɛ, has been completely lost. Subtraction itself not at fault: it merely signals loss of information that had already occurred Because of cancellation, it is generally a bad idea to compute any small quantity as the difference of large quantities, since rounding error is likely to dominate the result. The total energy, E of helium atom is the sum of kinetic (K.E.) and potential energies (P.E.), which are computed separately and have opposite signs, so suffer cancellation.

27 Section 4: Cancellation 27 The following table gives a sequence of values obtained over 18 years. During this span the computed values for the K.E. range from to 13.0, a 6.3% variation, and the P.E. range from 14.0 to 14.84, a 6.0% variation. However the computed values for E range from 1.0 to 2.44, a 144% variation! Year K.E. P.E. E = K.E. + P.E Example: Quadratic Formula The two roots of the quadratic equation ax 2 + bx + c = 0,

28 Section 4: Cancellation 28 are given by x 1,2 = b ± b 2 4ac. 2a Naive use of the above quadratic formula can suffer over flow, or under flow, or severe cancellation. Rescaling coefficients can help avoid over flow and harmful under flow. Cancellation inside the square root cannot be easily avoided without using higher precision arithmetic. Cancellation between b and the square root can be avoided by computing one root using the following alternative formula: ( b ± ) ( b x 1,2 = 2 4ac b ) b 2 4ac 2a b b 2 4ac = b 2 (b 2 4ac) 2a( b b 2 4ac) = 2c b b 2 4ac. To see how that works, let us be specific and assume that b is negative. If 4ac is positive and very small compared to b 2, then the

29 Section 4: Cancellation 29 square-root is real and is only slightly smaller than b. Large cancellation is expected in computing x 2 using the original formula, but not for x 1. On the other hand, large cancellation is expected in computing x 1 using the alternate formula, but not for x 2. As an example, let us consider a floating-point system where β = 10 and p = 4. We want to compute the roots for a quadratic equation where a = , b = 98.78, and c = First, fl(b 2 ) = fl( ) = 9757, 4a = without rounding, 4ac = fl( ) = 1.005, fl(b 2 4ac) = 9756, and so fl( b 2 4ac) = Also fl(2a) = fl( ) = , and fl(2c) = fl( ) = all without rounding. Using the original formula to compute x 1, the numerator is fl( ) = fl(197.55) = (not 197.5), and so x 1 = fl(197.6/0.1002) = This result for x 1 is correct to all 4 digits. The numerator for x 2 is fl( ) = fl(0.0100) = , and so x 2 = f l(0.0100/0.1002) = But even the leading digit of this result for x 2 is incorrect. If we use the alternate formula to compute x 1, we have x 1 = f l(10.03/0.010) = This result is far from being correct. How- Toc Back Doc Doc

30 Section 4: Cancellation 30 ever for x 2, we have x 2 = fl(10.03/197.6) = fl( ) = The result correct to 4 significant digits is The above example was made up to bring out the problem associated with cancellation. With increasing precision p, b 2 has to be even much larger than 4ac before the problem shows up. In fact in IEEE double precision, using the above values for a, b and c, the roots are correctly computed up to about 14 significant digits even if the wrong formula is used Standard Deviation The mean of a sequence x i, i = 1, 2,... n, is given by x = 1 n x i, n i=1 and the standard deviation by [ ] n σ = (x i x) 2. n 1 i=1 Toc Back Doc Doc

31 Section 4: Cancellation 31 Some people use the mathematically equivalent formula [ ( n )] σ = x 2 i n x 2. n 1 i=1 to avoid making two passes through the data. Unfortunately, the two terms in the one-pass formula are usually large and nearly equal and so the single cancellation error at the end is more damaging numerically than all of cancellation errors in the two-pass formula combined.

### Floating-Point Numbers in Digital Computers

POLYTECHNIC UNIVERSITY Department of Computer and Information Science Floating-Point Numbers in Digital Computers K. Ming Leung Abstract: We explain how floating-point numbers are represented and stored

### Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 1 Scientific Computing Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

### Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 1 Scientific Computing Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

### Lecture Notes to Accompany. Scientific Computing An Introductory Survey. What is scientific computing?

Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T. Heath Scientific Computing What is scientific computing? Design and analysis of algorithms for solving

### Outline. 1 Scientific Computing. 2 Approximations. 3 Computer Arithmetic. Scientific Computing Approximations Computer Arithmetic

Outline 1 2 3 Michael T. Heath 2 / 46 Introduction Computational Problems General Strategy What is scientific computing? Design and analysis of algorithms for numerically solving mathematical problems

### Roundoff Errors and Computer Arithmetic

Jim Lambers Math 105A Summer Session I 2003-04 Lecture 2 Notes These notes correspond to Section 1.2 in the text. Roundoff Errors and Computer Arithmetic In computing the solution to any mathematical problem,

### Review Questions 26 CHAPTER 1. SCIENTIFIC COMPUTING

26 CHAPTER 1. SCIENTIFIC COMPUTING amples. The IEEE floating-point standard can be found in [131]. A useful tutorial on floating-point arithmetic and the IEEE standard is [97]. Although it is no substitute

### 2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers 2.1 Floating-Point Representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However, real numbers

### 2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers 2.1 Floating-Point Representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However, real numbers

### Floating-Point Arithmetic

Floating-Point Arithmetic Raymond J. Spiteri Lecture Notes for CMPT 898: Numerical Software University of Saskatchewan January 9, 2013 Objectives Floating-point numbers Floating-point arithmetic Analysis

### Floating-point representation

Lecture 3-4: Floating-point representation and arithmetic Floating-point representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However,

### Computational Methods. Sources of Errors

Computational Methods Sources of Errors Manfred Huber 2011 1 Numerical Analysis / Scientific Computing Many problems in Science and Engineering can not be solved analytically on a computer Numeric solutions

### Most nonzero floating-point numbers are normalized. This means they can be expressed as. x = ±(1 + f) 2 e. 0 f < 1

Floating-Point Arithmetic Numerical Analysis uses floating-point arithmetic, but it is just one tool in numerical computation. There is an impression that floating point arithmetic is unpredictable and

### Mathematical preliminaries and error analysis

Mathematical preliminaries and error analysis Tsung-Ming Huang Department of Mathematics National Taiwan Normal University, Taiwan August 28, 2011 Outline 1 Round-off errors and computer arithmetic IEEE

### fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Floating Point Arithmetic fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation for example, fixed point number

### Computational Mathematics: Models, Methods and Analysis. Zhilin Li

Computational Mathematics: Models, Methods and Analysis Zhilin Li Chapter 1 Introduction Why is this course important (motivations)? What is the role of this class in the problem solving process using

### Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

1 Logistics Notes for 2016-09-07 1. We are still at 50. If you are still waiting and are not interested in knowing if a slot frees up, let me know. 2. There is a correction to HW 1, problem 4; the condition

### CS321 Introduction To Numerical Methods

CS3 Introduction To Numerical Methods Fuhua (Frank) Cheng Department of Computer Science University of Kentucky Lexington KY 456-46 - - Table of Contents Errors and Number Representations 3 Error Types

### 2.1.1 Fixed-Point (or Integer) Arithmetic

x = approximation to true value x error = x x, relative error = x x. x 2.1.1 Fixed-Point (or Integer) Arithmetic A base 2 (base 10) fixed-point number has a fixed number of binary (decimal) places. 1.

### Review of Calculus, cont d

Jim Lambers MAT 460/560 Fall Semester 2009-10 Lecture 4 Notes These notes correspond to Sections 1.1 1.2 in the text. Review of Calculus, cont d Taylor s Theorem, cont d We conclude our discussion of Taylor

### Section 1.4 Mathematics on the Computer: Floating Point Arithmetic

Section 1.4 Mathematics on the Computer: Floating Point Arithmetic Key terms Floating point arithmetic IEE Standard Mantissa Exponent Roundoff error Pitfalls of floating point arithmetic Structuring computations

### Objectives. look at floating point representation in its basic form expose errors of a different form: rounding error highlight IEEE-754 standard

Floating Point Objectives look at floating point representation in its basic form expose errors of a different form: rounding error highlight IEEE-754 standard 1 Why this is important: Errors come in two

### Floating Point Representation in Computers

Floating Point Representation in Computers Floating Point Numbers - What are they? Floating Point Representation Floating Point Operations Where Things can go wrong What are Floating Point Numbers? Any

### Numerical Computing: An Introduction

Numerical Computing: An Introduction Gyula Horváth Horvath@inf.u-szeged.hu Tom Verhoeff T.Verhoeff@TUE.NL University of Szeged Hungary Eindhoven University of Technology The Netherlands Numerical Computing

### Finite arithmetic and error analysis

Finite arithmetic and error analysis Escuela de Ingeniería Informática de Oviedo (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 1 / 45 Outline 1 Number representation:

### MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic September 28, 2018 Lecture 1 September 28, 2018 1 / 25 Floating point arithmetic Computers use finite strings of binary digits to represent

### Floating-point numbers. Phys 420/580 Lecture 6

Floating-point numbers Phys 420/580 Lecture 6 Random walk CA Activate a single cell at site i = 0 For all subsequent times steps, let the active site wander to i := i ± 1 with equal probability Random

### Floating Point Representation. CS Summer 2008 Jonathan Kaldor

Floating Point Representation CS3220 - Summer 2008 Jonathan Kaldor Floating Point Numbers Infinite supply of real numbers Requires infinite space to represent certain numbers We need to be able to represent

### Binary floating point encodings

Week 1: Wednesday, Jan 25 Binary floating point encodings Binary floating point arithmetic is essentially scientific notation. Where in decimal scientific notation we write in floating point, we write

### Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Math 340 Fall 2014, Victor Matveev Binary system, round-off errors, loss of significance, and double precision accuracy. 1. Bits and the binary number system A bit is one digit in a binary representation

### What we need to know about error: Class Outline. Computational Methods CMSC/AMSC/MAPL 460. Errors in data and computation

Class Outline Computational Methods CMSC/AMSC/MAPL 460 Errors in data and computation Representing numbers in floating point Ramani Duraiswami, Dept. of Computer Science Computations should be as accurate

### CS321. Introduction to Numerical Methods

CS31 Introduction to Numerical Methods Lecture 1 Number Representations and Errors Professor Jun Zhang Department of Computer Science University of Kentucky Lexington, KY 40506 0633 August 5, 017 Number

### Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

Computational Methods CMSC/AMSC/MAPL 460 Representing numbers in floating point and associated issues Ramani Duraiswami, Dept. of Computer Science Class Outline Computations should be as accurate and as

### Chapter 2. Data Representation in Computer Systems

Chapter 2 Data Representation in Computer Systems Chapter 2 Objectives Understand the fundamentals of numerical data representation and manipulation in digital computers. Master the skill of converting

### Floating point numbers in Scilab

Floating point numbers in Scilab Michaël Baudin May 2011 Abstract This document is a small introduction to floating point numbers in Scilab. In the first part, we describe the theory of floating point

### Floating Point Considerations

Chapter 6 Floating Point Considerations In the early days of computing, floating point arithmetic capability was found only in mainframes and supercomputers. Although many microprocessors designed in the

### Number Systems CHAPTER Positional Number Systems

CHAPTER 2 Number Systems Inside computers, information is encoded as patterns of bits because it is easy to construct electronic circuits that exhibit the two alternative states, 0 and 1. The meaning of

### COMPUTER ORGANIZATION AND ARCHITECTURE

COMPUTER ORGANIZATION AND ARCHITECTURE For COMPUTER SCIENCE COMPUTER ORGANIZATION. SYLLABUS AND ARCHITECTURE Machine instructions and addressing modes, ALU and data-path, CPU control design, Memory interface,

### EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 4-B Floating-Point Arithmetic - II Spring 2017 Koren Part.4b.1 The IEEE Floating-Point Standard Four formats for floating-point

### UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 4-B Floating-Point Arithmetic - II Israel Koren ECE666/Koren Part.4b.1 The IEEE Floating-Point

### Computer Representation of Numbers and Computer Arithmetic

Computer Representation of Numbers and Computer Arithmetic January 17, 2017 Contents 1 Binary numbers 6 2 Memory 9 3 Characters 12 1 c A. Sandu, 1998-2015. Distribution outside classroom not permitted.

### 1.2 Round-off Errors and Computer Arithmetic

1.2 Round-off Errors and Computer Arithmetic 1 In a computer model, a memory storage unit word is used to store a number. A word has only a finite number of bits. These facts imply: 1. Only a small set

### ±M R ±E, S M CHARACTERISTIC MANTISSA 1 k j

ENEE 350 c C. B. Silio, Jan., 2010 FLOATING POINT REPRESENTATIONS It is assumed that the student is familiar with the discussion in Appendix B of the text by A. Tanenbaum, Structured Computer Organization,

### Classes of Real Numbers 1/2. The Real Line

Classes of Real Numbers All real numbers can be represented by a line: 1/2 π 1 0 1 2 3 4 real numbers The Real Line { integers rational numbers non-integral fractions irrational numbers Rational numbers

### UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 4-A Floating-Point Arithmetic Israel Koren ECE666/Koren Part.4a.1 Preliminaries - Representation

### Floating-Point Arithmetic

Floating-Point Arithmetic ECS30 Winter 207 January 27, 207 Floating point numbers Floating-point representation of numbers (scientific notation) has four components, for example, 3.46 0 sign significand

### UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 4-C Floating-Point Arithmetic - III Israel Koren ECE666/Koren Part.4c.1 Floating-Point Adders

### Number Systems. Both numbers are positive

Number Systems Range of Numbers and Overflow When arithmetic operation such as Addition, Subtraction, Multiplication and Division are performed on numbers the results generated may exceed the range of

### Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

Computational Methods CMSC/AMSC/MAPL 460 Representing numbers in floating point and associated issues Ramani Duraiswami, Dept. of Computer Science Class Outline Computations should be as accurate and as

### ME 261: Numerical Analysis. ME 261: Numerical Analysis

ME 261: Numerical Analysis 3. credit hours Prereq.: ME 163/ME 171 Course content Approximations and error types Roots of polynomials and transcendental equations Determinants and matrices Solution of linear

### Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Floating point Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties Next time! The machine model Chris Riesbeck, Fall 2011 Checkpoint IEEE Floating point Floating

### Floating Point Arithmetic

Floating Point Arithmetic Clark N. Taylor Department of Electrical and Computer Engineering Brigham Young University clark.taylor@byu.edu 1 Introduction Numerical operations are something at which digital

### Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Instructor: Nicole Hynes nicole.hynes@rutgers.edu 1 Fixed Point Numbers Fixed point number: integer part

### Hani Mehrpouyan, California State University, Bakersfield. Signals and Systems

Hani Mehrpouyan, Department of Electrical and Computer Engineering, California State University, Bakersfield Lecture 3 (Error and Computer Arithmetic) April 8 th, 2013 The material in these lectures is

### Numerical computing. How computers store real numbers and the problems that result

Numerical computing How computers store real numbers and the problems that result The scientific method Theory: Mathematical equations provide a description or model Experiment Inference from data Test

### Computing Basics. 1 Sources of Error LECTURE NOTES ECO 613/614 FALL 2007 KAREN A. KOPECKY

LECTURE NOTES ECO 613/614 FALL 2007 KAREN A. KOPECKY Computing Basics 1 Sources of Error Numerical solutions to problems differ from their analytical counterparts. Why? The reason for the difference is

### Chapter 3. Errors and numerical stability

Chapter 3 Errors and numerical stability 1 Representation of numbers Binary system : micro-transistor in state off 0 on 1 Smallest amount of stored data bit Object in memory chain of 1 and 0 10011000110101001111010010100010

### Computational Economics and Finance

Computational Economics and Finance Part I: Elementary Concepts of Numerical Analysis Spring 2015 Outline Computer arithmetic Error analysis: Sources of error Error propagation Controlling the error Rates

### CHAPTER 2 SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS

CHAPTER SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS The two main concepts involved here are the condition (of a problem) and the stability (of an algorithm). Both of these concepts deal with

### Lecture Notes: Floating-Point Numbers

Lecture Notes: Floating-Point Numbers CS227-Scientific Computing September 8, 2010 What this Lecture is About How computers represent numbers How this affects the accuracy of computation Positional Number

### COMPUTER ARCHITECTURE AND ORGANIZATION. Operation Add Magnitudes Subtract Magnitudes (+A) + ( B) + (A B) (B A) + (A B)

Computer Arithmetic Data is manipulated by using the arithmetic instructions in digital computers. Data is manipulated to produce results necessary to give solution for the computation problems. The Addition,

### What Every Programmer Should Know About Floating-Point Arithmetic

What Every Programmer Should Know About Floating-Point Arithmetic Last updated: October 15, 2015 Contents 1 Why don t my numbers add up? 3 2 Basic Answers 3 2.1 Why don t my numbers, like 0.1 + 0.2 add

### Number Systems Standard positional representation of numbers: An unsigned number with whole and fraction portions is represented as:

N Number Systems Standard positional representation of numbers: An unsigned number with whole and fraction portions is represented as: a n a a a The value of this number is given by: = a n Ka a a a a a

### Introduction to Numerical Computing

Statistics 580 Introduction to Numerical Computing Number Systems In the decimal system we use the 10 numeric symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 to represent numbers. The relative position of each symbol

### Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

Chapter 03: Computer Arithmetic Lesson 09: Arithmetic using floating point numbers Objective To understand arithmetic operations in case of floating point numbers 2 Multiplication of Floating Point Numbers

### Computational Economics and Finance

Computational Economics and Finance Part I: Elementary Concepts of Numerical Analysis Spring 2016 Outline Computer arithmetic Error analysis: Sources of error Error propagation Controlling the error Rates

### CGF Lecture 2 Numbers

CGF Lecture 2 Numbers Numbers A number is an abstract entity used originally to describe quantity. i.e. 80 Students etc The most familiar numbers are the natural numbers {0, 1, 2,...} or {1, 2, 3,...},

### Scientific Computing. Error Analysis

ECE257 Numerical Methods and Scientific Computing Error Analysis Today s s class: Introduction to error analysis Approximations Round-Off Errors Introduction Error is the difference between the exact solution

### MATH 353 Engineering mathematics III

MATH 353 Engineering mathematics III Instructor: Francisco-Javier Pancho Sayas Spring 2014 University of Delaware Instructor: Francisco-Javier Pancho Sayas MATH 353 1 / 20 MEET YOUR COMPUTER Instructor:

### Bits, Words, and Integers

Computer Science 52 Bits, Words, and Integers Spring Semester, 2017 In this document, we look at how bits are organized into meaningful data. In particular, we will see the details of how integers are

### Floating Point January 24, 2008

15-213 The course that gives CMU its Zip! Floating Point January 24, 2008 Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties class04.ppt 15-213, S 08 Floating

### The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive.

IEEE 754 Standard - Overview Frozen Content Modified by on 13-Sep-2017 Before discussing the actual WB_FPU - Wishbone Floating Point Unit peripheral in detail, it is worth spending some time to look at

### Computer Arithmetic Floating Point

Computer Arithmetic Floating Point Chapter 3.6 EEC7 FQ 25 About Floating Point Arithmetic Arithmetic basic operations on floating point numbers are: Add, Subtract, Multiply, Divide Transcendental operations

### CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

CS 6210 Fall 2016 Bei Wang Lecture 4 Floating Point Systems Continued Take home message 1. Floating point rounding 2. Rounding unit 3. 64 bit word: double precision (IEEE standard word) 4. Exact rounding

### Floating-Point Arithmetic

ENEE446---Lectures-4/10-15/08 A. Yavuz Oruç Professor, UMD, College Park Copyright 2007 A. Yavuz Oruç. All rights reserved. Floating-Point Arithmetic Integer or fixed-point arithmetic provides a complete

### Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Floating Point Numbers in Java by Michael L. Overton Virtually all modern computers follow the IEEE 2 floating point standard in their representation of floating point numbers. The Java programming language

### IEEE Standard 754 Floating Point Numbers

IEEE Standard 754 Floating Point Numbers Steve Hollasch / Last update 2005-Feb-24 IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based

### CHAPTER 5 Computer Arithmetic and Round-Off Errors

CHAPTER 5 Computer Arithmetic and Round-Off Errors In the two previous chapters we have seen how numbers can be represented in the binary numeral system and how this is the basis for representing numbers

### Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

Floating-point Arithmetic Reading: pp. 312-328 Floating-Point Representation Non-scientific floating point numbers: A non-integer can be represented as: 2 4 2 3 2 2 2 1 2 0.2-1 2-2 2-3 2-4 where you sum

### Lecture Objectives. Structured Programming & an Introduction to Error. Review the basic good habits of programming

Structured Programming & an Introduction to Error Lecture Objectives Review the basic good habits of programming To understand basic concepts of error and error estimation as it applies to Numerical Methods

### Numerical Methods 5633

Numerical Methods 5633 Lecture 2 Marina Krstic Marinkovic mmarina@maths.tcd.ie School of Mathematics Trinity College Dublin Marina Krstic Marinkovic 1 / 15 5633-Numerical Methods Organisational Assignment

### 4.1 QUANTIZATION NOISE

DIGITAL SIGNAL PROCESSING UNIT IV FINITE WORD LENGTH EFFECTS Contents : 4.1 Quantization Noise 4.2 Fixed Point and Floating Point Number Representation 4.3 Truncation and Rounding 4.4 Quantization Noise

### CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI 3.5 Today s Contents Floating point numbers: 2.5, 10.1, 100.2, etc.. How

### Data Representation Floating Point

Data Representation Floating Point CSCI 2400 / ECE 3217: Computer Architecture Instructor: David Ferry Slides adapted from Bryant & O Hallaron s slides via Jason Fritts Today: Floating Point Background:

### AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

AM205: lecture 2 Luna and Gary will hold a Python tutorial on Wednesday in 60 Oxford Street, Room 330 Assignment 1 will be posted this week Chris will hold office hours on Thursday (1:30pm 3:30pm, Pierce

### Computer arithmetics: integers, binary floating-point, and decimal floating-point

n!= 0 && -n == n z+1 == z Computer arithmetics: integers, binary floating-point, and decimal floating-point v+w-w!= v x+1 < x Peter Sestoft 2010-02-16 y!= y p == n && 1/p!= 1/n 1 Computer arithmetics Computer

### Data Representation 1

1 Data Representation Outline Binary Numbers Adding Binary Numbers Negative Integers Other Operations with Binary Numbers Floating Point Numbers Character Representation Image Representation Sound Representation

### Data Representation Floating Point

Data Representation Floating Point CSCI 2400 / ECE 3217: Computer Architecture Instructor: David Ferry Slides adapted from Bryant & O Hallaron s slides via Jason Fritts Today: Floating Point Background:

### Divide: Paper & Pencil

Divide: Paper & Pencil 1001 Quotient Divisor 1000 1001010 Dividend -1000 10 101 1010 1000 10 Remainder See how big a number can be subtracted, creating quotient bit on each step Binary => 1 * divisor or

### Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

Systems I Floating Point Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties IEEE Floating Point IEEE Standard 754 Established in 1985 as uniform standard for

### Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Floating Point Puzzles Topics Lecture 3B Floating Point IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties For each of the following C expressions, either: Argue that

### EE 109 Unit 19. IEEE 754 Floating Point Representation Floating Point Arithmetic

1 EE 109 Unit 19 IEEE 754 Floating Point Representation Floating Point Arithmetic 2 Floating Point Used to represent very small numbers (fractions) and very large numbers Avogadro s Number: +6.0247 * 10

### System Programming CISC 360. Floating Point September 16, 2008

System Programming CISC 360 Floating Point September 16, 2008 Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Powerpoint Lecture Notes for Computer Systems:

### Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Real Numbers We have been studying integer arithmetic up to this point. We have discovered that a standard computer can represent a finite subset of the infinite set of integers. The range is determined

### CO212 Lecture 10: Arithmetic & Logical Unit

CO212 Lecture 10: Arithmetic & Logical Unit Shobhanjana Kalita, Dept. of CSE, Tezpur University Slides courtesy: Computer Architecture and Organization, 9 th Ed, W. Stallings Integer Representation For

### Characters, Strings, and Floats

Characters, Strings, and Floats CS 350: Computer Organization & Assembler Language Programming 9/6: pp.8,9; 9/28: Activity Q.6 A. Why? We need to represent textual characters in addition to numbers. Floating-point

### Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Chapter 2 Float Point Arithmetic Topics IEEE Floating Point Standard Fractional Binary Numbers Rounding Floating Point Operations Mathematical properties Real Numbers in Decimal Notation Representation

### Computer Arithmetic. 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.

ECS231 Handout Computer Arithmetic I: Floating-point numbers and representations 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.1416 10 1 sign significandbase