Floating-Point Arithmetic

Size: px

Start display at page:

Download "Floating-Point Arithmetic"

Cecilia Edwards
6 years ago
Views:

1 Floating-Point Arithmetic ECS30 Winter 207 January 27, 207

2 Floating point numbers Floating-point representation of numbers (scientific notation) has four components, for example, sign significand base exponent Computers use binary (base 2) representation, each digit is the integer 0 or, e.g. 00 = = 22 in base 0 and 0 = = =

3 Floating point numbers The representation of a floating point number: x = ±b 0.b b 2 b p 2 E where It is normalized, i.e., b0 = (the hidden bit) Precision (= p) is the number of bits in the significand (mantissa) (including the hidden bit). Machine epsilon ɛm = 2 (p ), the gap between the number and the smallest floating point number that is greater than. Special numbers 0, 0, Inf =, Inf =, NaN = Not a Number.

4 IEEE standard All computers designed since 985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std ), represent each number as a binary number and use binary arithmetic. Essentials of the IEEE standard:. consistent representation of floating-point numbers by all machines adopting the standard; 2. correctly rounded floating-point operations, using various rounding modes; 3. consistent treatment of exceptional situation such as division by zero. Citation of 989 Turing Award to William Kahan:... his devotion to providing systems that can be safely and robustly used for numerical computations has impacted the lives of virtually anyone who will use a computer in the future

5 IEEE single precision format Single format takes 32 bits (=4 bytes) long: 8 23 s E f sign exponent binary point fraction It represents the number ( ) s (.f) 2 E 27 The leading in the fraction need not be stored explicitly since it is always (hidden bit) Precision p = 24 and machine epsilon ɛ m = The E 27 in the exponent is to avoid the need for storage of a sign bit. E min = ( ) 2 = () 0, E max = (0) 2 = (254) 0. The range of positive normalized numbers: N min = Emin 27 = N max =. 2 Emax Special repsentations for 0, ± and NaN.

6 IEEE double pecision format IEEE double format takes 64 bits (= 8 bytes) long: 52 s E f sign exponent binary point fraction It represents the numer ( ) s (.f) 2 E 023 Precision p = 53 and machine epsilon ɛ m = The range of positive normalized numbers is from N min = N max = Special repsentations for 0, ± and NaN.

7 Rounding modes Let a positive real number x be in the normalized range, i.e., N min x N max, and write in the normalized form x =.b b 2 b p b p b p E, Then the closest floating-pont number less than or equal to x is i.e., x is obtained by truncating. x =.b b 2 b p 2 E The next floating-point number bigger than x (also the next one that bigger than x) is x + = (.b b 2 b p ) 2 E If x is negative, the situtation is reversed.

8 rounding modes, cont d Four rounding modes:. round down: fl(x) = x 2. round up: fl(x) = x + 3. round towards zero: fl(x) = x of x 0 fl(x) = x + of x 0 4. round to nearest (IEEE default rounding mode): fl(x) = x or x + whichever is nearer to x. Note: except that if x > Nmax, fl(x) =, and if x < Nmax, fl(x) =. In the case of tie, i.e., x and x + are the same distance from x, the one with its least significant bit equal to zero is chosen.

9 Rounding error When the round to nearest is in effect, Therefore, we have relerr = relerr(x) = fl(x) x x 2 ɛ m = , single precision , double precision.

10 Floating-point arithmetic IEEE rules for correctly rounded floating-point operations: if x and y are correctly rounded floating-point numbers, then where for the round to nearest, fl(x + y) = (x + y)( + δ) fl(x y) = (x y)( + δ) fl(x y) = (x y)( + δ) fl(x/y) = (x/y)( + δ) δ 2 ɛ m IEEE standard also requires that correctly rounded remainder and square root operations be provided.

11 Floating-point arithmetic, cont d IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0 NaN Division by zero Finite nonzero/0 ± Overflow x > N max ± or ±N max underflow x 0, x < N min ±0, ±N min or subnormal Inexact whenever fl(x y) x y correctly rounded value

12 Floating-point arithmetic error Let ˆx and ŷ be the floating-point numbers and that ˆx = x( + τ ) and ŷ = y( + τ 2 ), for τ i τ where τ i could be the relative errors in the process of collecting/getting the data from the original source or the previous operations, and Question: how do the four basic arithmetic operations behave?

13 Floating-point arithmetic error: +, Addition and subtraction: fl(ˆx + ŷ) = (ˆx + ŷ)( + δ), δ 2 ɛm = x( + τ )( + δ) + y( + τ 2)( + δ) = x + y + x(τ + δ + O(τɛ m)) + y(τ 2 + δ + O(τɛ m)) ( = (x + y) + x (τ + δ + O(τɛm)) + y ) (τ2 + δ + O(τɛm)) x + y x + y (x + y)( + ˆδ), where ˆδ can be bounded as follows: ˆδ x + y x + y ( τ + ) 2 ɛ m + O(τɛ m ).

14 Floating-point arithmetic error: +, Three possible cases:. If x and y have the same sign, i.e., xy > 0, then x + y = x + y ; this implies ˆδ τ + 2 ɛ m + O(τɛ m ). Thus fl(ˆx + ŷ) approximates x + y well. 2. If x y x + y 0, then ( x + y )/ x + y ; this implies that ˆδ could be nearly or much bigger than. This is so called catastrophic cancellation, it causes relative errors or uncertainties already presented in ˆx and ŷ to be magnified. 3. In general, if ( x + y )/ x + y is not too big, fl(ˆx + ŷ) provides a good approximation to x + y.

15 Catastrophic cancellation: example Computing x + x straightforward causes substantial loss of significant digits for large n x fl( x + ) fl( x) fl(fl( x + ) fl( x).00e e e e-06.00e e e e-06.00e e e e-07.00e e e e-07.00e e e e-08.00e e e e-08.00e e e e+00

16 Catastrophic cancellation: example Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. For example, one can compute x + x almost to full precision by using the equality x + x = x + + x. Consequently, the computed results are n fl(/( n + + n)).00e e-06.00e e-06.00e e-07.00e e-07.00e e-08.00e e-08.00e e-09

17 Catastrophic cancellation: example 2 Consider the function h(x) = cos x x 2 Note that 0 f(x) < /2 for all x 0. Let x =.2 0 8, then the computed is completely wrong! fl(h(x)) = Alternatively, the function can be re-written as ( ) 2 sin(x/2) h(x) =. x/2 Consequently, for x =.2 0 8, then the computed function fl(h(x)) = < /2 is fine!

18 Floating-point arithmetic error:, / Multiplication and Division: fl(ˆx ŷ) = (ˆx ŷ)( + δ) = xy( + τ )( + τ 2 )( + δ) xy( + ˆδ ), fl(ˆx/ŷ) = (ˆx/ŷ)( + δ) = (x/y)( + τ )( + τ 2 ) ( + δ) xy( + ˆδ ), where ˆδ = τ + τ 2 + δ + O(τɛ m ) ˆδ = τ τ 2 + δ + O(τɛ m ). Thus ˆδ 2τ + 2 ɛ m + O(τɛ m ) and ˆδ 2τ + 2 ɛ m + O(τɛ m ). Multiplication and division are very well-behaved!

19 Reading Section.7 of Numerical Computing with MATLAB by C. Moler Websites discussions of numerical disasters: T. Huckle, Collection of software bugs huckle/bugse.html K. Vuik, Some disasters caused by numerical errors D. Arnold, Some disasters attributable to bad numerical computing arnold/disasters/disasters.html In-depth material: D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Survey, Vol.23(), pp.5-48, 99

20 LU factorization revisited: the need of pivoting in LU factorization, numerically LU factorization without pivoting [ ] [ ] [ A = = = LU = l 2 ] [ ] u u 2 u 22 In three decimal-digit floating-point arithmetic, we have [ ] [ ] 0 0 L = fl(/0 4 = ) 0 4, [ ] [ ] Û = fl( 0 4 = ) 0 4, Check: LÛ = [ ] [ ] [ = 0 ] A,

21 LU factorization revisited: the need of pivoting in LU factorization, numerically [ ] Consider solving Ax = for x using this LU factorization. 2 Solving Ly = [ 2 ] = ŷ = fl(/) = ŷ 2 = fl(2 0 4 ) = 0 4. note that the value 2 has been lost by subtracting 0 4 from it. Solving [ ] Ûx = ŷ = 0 4 = x2 = fl(( 04 )/( 0 4 )) = x = fl(( )/0 4 ) = 0, [ ] [ ] x 0 x = = is a completely erroneous solution to the correct x 2 [ ] answer x.

22 LU factorization revisited: the need of pivoting in LU factorization, numerically LU factorization with partial pivoting P A = LU, By exchanging the order of the rows of A, i.e., [ ] 0 P =, 0 Then for the LU factorization of P A is given by [ ] [ ] 0 0 L = fl(0 4 = /) 0 4, [ ] [ ] Û = fl( 0 4 =. ) The computed LU approximates A very accurately. As a result, the computed solution x is also perfect!

Computer Arithmetic. 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.

Computer Arithmetic. 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3. ECS231 Handout Computer Arithmetic I: Floating-point numbers and representations 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.1416 10 1 sign significandbase