Floating-Point Arithmetic - PDF Free Download

Floating-Point Arithmetic ECS30 Winter 207 January 27, 207

Floating point numbers Floating-point representation of numbers (scientific notation) has four components, for example, 3.46 0 sign significand base exponent Computers use binary (base 2) representation, each digit is the integer 0 or, e.g. 00 = 2 4 + 0 2 3 + 2 2 + 2 + 0 2 0 = 22 in base 0 and 0 = 6 + 32 + 256 + 52 + 4096 + 892 +... = 2 4 + 2 5 + 2 8 + 2 9 + 2 2 + 2 3 +... =.0000... 2 4

Floating point numbers The representation of a floating point number: x = ±b 0.b b 2 b p 2 E where It is normalized, i.e., b0 = (the hidden bit) Precision (= p) is the number of bits in the significand (mantissa) (including the hidden bit). Machine epsilon ɛm = 2 (p ), the gap between the number and the smallest floating point number that is greater than. Special numbers 0, 0, Inf =, Inf =, NaN = Not a Number.

IEEE standard All computers designed since 985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-985), represent each number as a binary number and use binary arithmetic. Essentials of the IEEE standard:. consistent representation of floating-point numbers by all machines adopting the standard; 2. correctly rounded floating-point operations, using various rounding modes; 3. consistent treatment of exceptional situation such as division by zero. Citation of 989 Turing Award to William Kahan:... his devotion to providing systems that can be safely and robustly used for numerical computations has impacted the lives of virtually anyone who will use a computer in the future

IEEE single precision format Single format takes 32 bits (=4 bytes) long: 8 23 s E f sign exponent binary point fraction It represents the number ( ) s (.f) 2 E 27 The leading in the fraction need not be stored explicitly since it is always (hidden bit) Precision p = 24 and machine epsilon ɛ m = 2 23.2 0 7 The E 27 in the exponent is to avoid the need for storage of a sign bit. E min = (0000000) 2 = () 0, E max = (0) 2 = (254) 0. The range of positive normalized numbers: N min =.00 0 2 Emin 27 = 2 26.2 0 38 N max =. 2 Emax 27 2 28 3.4 0 38. Special repsentations for 0, ± and NaN.

IEEE double pecision format IEEE double format takes 64 bits (= 8 bytes) long: 52 s E f sign exponent binary point fraction It represents the numer ( ) s (.f) 2 E 023 Precision p = 53 and machine epsilon ɛ m = 2 52 2.2 0 6 The range of positive normalized numbers is from N min = 2 022 2.2 0 308 N max =. 2 023 2 024.8 0 308. Special repsentations for 0, ± and NaN.

Rounding modes Let a positive real number x be in the normalized range, i.e., N min x N max, and write in the normalized form x =.b b 2 b p b p b p+... 2 E, Then the closest floating-pont number less than or equal to x is i.e., x is obtained by truncating. x =.b b 2 b p 2 E The next floating-point number bigger than x (also the next one that bigger than x) is x + = (.b b 2 b p + 0.00 0) 2 E If x is negative, the situtation is reversed.

rounding modes, cont d Four rounding modes:. round down: fl(x) = x 2. round up: fl(x) = x + 3. round towards zero: fl(x) = x of x 0 fl(x) = x + of x 0 4. round to nearest (IEEE default rounding mode): fl(x) = x or x + whichever is nearer to x. Note: except that if x > Nmax, fl(x) =, and if x < Nmax, fl(x) =. In the case of tie, i.e., x and x + are the same distance from x, the one with its least significant bit equal to zero is chosen.

Rounding error When the round to nearest is in effect, Therefore, we have relerr = relerr(x) = fl(x) x x 2 ɛ m. 2 2 24 = 2 24 5.96 0 8, single precision 2 2 52. 0 6, double precision.

Floating-point arithmetic IEEE rules for correctly rounded floating-point operations: if x and y are correctly rounded floating-point numbers, then where for the round to nearest, fl(x + y) = (x + y)( + δ) fl(x y) = (x y)( + δ) fl(x y) = (x y)( + δ) fl(x/y) = (x/y)( + δ) δ 2 ɛ m IEEE standard also requires that correctly rounded remainder and square root operations be provided.

Floating-point arithmetic, cont d IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0 NaN Division by zero Finite nonzero/0 ± Overflow x > N max ± or ±N max underflow x 0, x < N min ±0, ±N min or subnormal Inexact whenever fl(x y) x y correctly rounded value

Floating-point arithmetic error Let ˆx and ŷ be the floating-point numbers and that ˆx = x( + τ ) and ŷ = y( + τ 2 ), for τ i τ where τ i could be the relative errors in the process of collecting/getting the data from the original source or the previous operations, and Question: how do the four basic arithmetic operations behave?

Floating-point arithmetic error: +, Addition and subtraction: fl(ˆx + ŷ) = (ˆx + ŷ)( + δ), δ 2 ɛm = x( + τ )( + δ) + y( + τ 2)( + δ) = x + y + x(τ + δ + O(τɛ m)) + y(τ 2 + δ + O(τɛ m)) ( = (x + y) + x (τ + δ + O(τɛm)) + y ) (τ2 + δ + O(τɛm)) x + y x + y (x + y)( + ˆδ), where ˆδ can be bounded as follows: ˆδ x + y x + y ( τ + ) 2 ɛ m + O(τɛ m ).

Floating-point arithmetic error: +, Three possible cases:. If x and y have the same sign, i.e., xy > 0, then x + y = x + y ; this implies ˆδ τ + 2 ɛ m + O(τɛ m ). Thus fl(ˆx + ŷ) approximates x + y well. 2. If x y x + y 0, then ( x + y )/ x + y ; this implies that ˆδ could be nearly or much bigger than. This is so called catastrophic cancellation, it causes relative errors or uncertainties already presented in ˆx and ŷ to be magnified. 3. In general, if ( x + y )/ x + y is not too big, fl(ˆx + ŷ) provides a good approximation to x + y.

Catastrophic cancellation: example Computing x + x straightforward causes substantial loss of significant digits for large n x fl( x + ) fl( x) fl(fl( x + ) fl( x).00e+0.00000000004999994e+05.00000000000000000e+05 4.9999944672658e-06.00e+ 3.6227766084906e+05 3.622776606837908e+05.58529005342865e-06.00e+2.00000000000050000e+06.00000000000000000e+06 5.00003807246685028e-07.00e+3 3.622776606853740e+06 3.622776606837955e+06.5785976397323608e-07.00e+4.00000000000000503e+07.00000000000000000e+07 5.029490292358398e-08.00e+5 3.62277660683804e+07 3.62277660683797e+07.8626454923095703e-08.00e+6.00000000000000000e+08.00000000000000000e+08 0.00000000000000000e+00

Catastrophic cancellation: example Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. For example, one can compute x + x almost to full precision by using the equality x + x = x + + x. Consequently, the computed results are n fl(/( n + + n)).00e+0 4.999999999875000e-06.00e+.5838830080237e-06.00e+2 4.999999999998749e-07.00e+3.583883008450e-07.00e+4 4.999999999999987e-08.00e+5.583883008489e-08.00e+6 5.000000000000000e-09

Catastrophic cancellation: example 2 Consider the function h(x) = cos x x 2 Note that 0 f(x) < /2 for all x 0. Let x =.2 0 8, then the computed is completely wrong! fl(h(x)) = 0.770988... Alternatively, the function can be re-written as ( ) 2 sin(x/2) h(x) =. x/2 Consequently, for x =.2 0 8, then the computed function fl(h(x)) = 0.499999... < /2 is fine!

Floating-point arithmetic error:, / Multiplication and Division: fl(ˆx ŷ) = (ˆx ŷ)( + δ) = xy( + τ )( + τ 2 )( + δ) xy( + ˆδ ), fl(ˆx/ŷ) = (ˆx/ŷ)( + δ) = (x/y)( + τ )( + τ 2 ) ( + δ) xy( + ˆδ ), where ˆδ = τ + τ 2 + δ + O(τɛ m ) ˆδ = τ τ 2 + δ + O(τɛ m ). Thus ˆδ 2τ + 2 ɛ m + O(τɛ m ) and ˆδ 2τ + 2 ɛ m + O(τɛ m ). Multiplication and division are very well-behaved!

Reading Section.7 of Numerical Computing with MATLAB by C. Moler Websites discussions of numerical disasters: T. Huckle, Collection of software bugs http://www5.in.tum.de/ huckle/bugse.html K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi2/disasters.html D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/ arnold/disasters/disasters.html In-depth material: D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Survey, Vol.23(), pp.5-48, 99

LU factorization revisited: the need of pivoting in LU factorization, numerically LU factorization without pivoting [ ] [ ] [.000 0 4 A = = = LU = l 2 ] [ ] u u 2 u 22 In three decimal-digit floating-point arithmetic, we have [ ] [ ] 0 0 L = fl(/0 4 = ) 0 4, [ ] [ ] 0 4 0 4 Û = fl( 0 4 = ) 0 4, Check: LÛ = [ 0 0 4 ] [ ] [ 0 4 0 4 0 4 = 0 ] A,

LU factorization revisited: the need of pivoting in LU factorization, numerically [ ] Consider solving Ax = for x using this LU factorization. 2 Solving Ly = [ 2 ] = ŷ = fl(/) = ŷ 2 = fl(2 0 4 ) = 0 4. note that the value 2 has been lost by subtracting 0 4 from it. Solving [ ] Ûx = ŷ = 0 4 = x2 = fl(( 04 )/( 0 4 )) = x = fl(( )/0 4 ) = 0, [ ] [ ] x 0 x = = is a completely erroneous solution to the correct x 2 [ ] answer x.

LU factorization revisited: the need of pivoting in LU factorization, numerically LU factorization with partial pivoting P A = LU, By exchanging the order of the rows of A, i.e., [ ] 0 P =, 0 Then for the LU factorization of P A is given by [ ] [ ] 0 0 L = fl(0 4 = /) 0 4, [ ] [ ] Û = fl( 0 4 =. ) The computed LU approximates A very accurately. As a result, the computed solution x is also perfect!