Introduction to floating point arithmetic

Size: px

Start display at page:

Download "Introduction to floating point arithmetic"

Jeffry Rich
5 years ago
Views:

1 Introduction to floating point arithmetic Matthias Petschow and Paolo Bientinesi AICES, RWTH Aachen October 24th, 2013 Aachen, Germany Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

2 Disclaimer Muller et al. - Handbook of Floating-Point Arithmetic pages! Many topics not covered in this lecture (e.g., hardware/software implementation of FPA, language support, cleverly using FPA) This lecture: basics of floating point representation and arithmetic Goal: Make you aware of the issues arising in finite precision computations See the references below for a more thorough treatment of the topic Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

3 References Article: What every computer scientist should know about floating-point arithmetic, by David Goldberg Article: Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic, by William Kahan Book: Accuracy and Stability of Numerical Algorithms, by Nick Higham Book: Numerical Computing with IEEE Floating Point Arithmetic, by Michael Overton IEEE and IEEE : Standard for Floating-Point Arithmetic Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

4 Numerical Representation Numbers 123 = (first 40 digits) π = In general: Infinite number of digits Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

5 Numerical Representation Numbers 123 = (first 40 digits) π = In general: Infinite number of digits Computers Finite memory Approximated numbers Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

6 FP representation Floating point system F\{0} ( 1) s d 0.d 1 d 2 d 3... d p 1 β e with base β, precision p, and s {0, 1} d i {0,..., β 1} d 0 0 e min e e max IEEE-754 specifications (β = 2, d 0 = 1) Single: p = 24, e min = 126, e max = 127 Double: p = 53, e min = 1022, e max = 1023 Additionally: ±0, ±, subnormal numbers, NaNs Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

7 IEEE single precision Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

8 IEEE single precision ( 1) s (1 + f) 2 E 127 f = d d d d e = E 127, biased exponent 1 E 254 Special values for E = 0: F = 0 : ±0 F 0 : subnormal numbers with d 0 = 0 and implicit exponent e = 126 Special values for E = 255: F = 0 : ± F 0 : NaNs Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

9 IEEE double precision Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

10 Example questions Question 1 What is the largest finite floating point number is IEEE single precision? Question 2 What is the smallest positive normalized floating point number is IEEE single precision? What is the smallest positive number? Question 3 How many normalized IEEE single precision numbers and how many subnormal numbers are there? Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

11 Example questions Question 4 Given IEEE single/double precision (p = 24, 53), what is the gap between 1 and the next larger number? Question 5 Between an adjacent pair of nonzero IEEE single precision real numbers, how many IEEE double precision numbers are there? Question 6 What is the largest integer u such that all integer in the interval [ u, u] are exactly representable in IEEE single precision format? What is the corresponding u for double precision? Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

12 Representation error Relative rounding error Let x R and x [ω min, ω max ], then x = x(1 + δ) where δ u and x = x/(1 + δ) where δ u with u denoting the unit roundoff. ω min = smallest normalized positive floating point number ω max = largest normalized positive floating point number x = [x] = floating point representation of x If rounded to nearest float, x = RN(x) and u = 2 p For other rounding modes, u = 2 p+1 Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

13 Example questions Task Given a base 2 floating point format with precision p, show that if x R lies in the normalized range than RN(x) = x(1 + δ), with δ 2 p, where RN() rounds to the nearest floating point number. Question What if RN(x) is subnormal? Find an example where the above is not true. Question What if we truncate that is round to zero RZ() instead of rounding to the nearest number RN()? Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

14 Finite precision arithmetic 4-digit representation Inexact Arithmetic = = Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

15 Finite precision arithmetic 4-digit representation Inexact Arithmetic = = Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

16 Finite precision arithmetic 4-digit representation Inexact Arithmetic = = Truncated Rounded Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

17 Finite precision arithmetic 4-digit representation Inexact Arithmetic Associativity? = = Truncated Rounded Exact arithmetic: ( ) = ( ) Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

18 Finite precision arithmetic 4-digit representation Inexact Arithmetic Associativity? = = Truncated Rounded Exact arithmetic: ( ) = ( ) Inexact arithmetic: ( ) = Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

19 Finite precision arithmetic 4-digit representation Inexact Arithmetic Associativity? = = Truncated Rounded No! Exact arithmetic: ( ) = ( ) Inexact arithmetic: ( ) = ( ) = Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

20 IEEE FP arithmetic The standard floating point model Given operation {+,,, /} and x, y F 0, [x y] = RN(x y) = (x y)(1 + δ) with δ u, provided no underflow/overflow occurred. RN() can be replaced by other rounding modes. For RN() we have seen that u = 2 p. Also, (x y)(1 + δ) can be replaced by (x y)/(1 + δ). Similarly, x = RN(x). What about sin(x), cos(x), exp(x), log(x),...? Notation: Assuming a left-to-right evaluation, it holds [ [[x] ] [ ] ] [x + y + z/w] = + [y] + [z]/[w] Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

21 Example questions Question 1 If x is a floating point number, is the floating point product [1 x] equal to x? Question 2 If x 0 is a (finite) floating point number, is the floating point quotient [x/x] equal to 1? Question 3 If x is a floating point number, is the floating point product [0.5 x] equal to floating point quotient [x/2]? Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

22 Example questions Question 4 Is it true that for all a, b R we have [a + b] = [b + a] and [a b] = [b a]? Question 5 Let a = 1, b = 1, and c = In IEEE single precision arithmetic, what are the results of [a + [b + c]] and [[a + b] + c]? Question 6 Let a = b = 2 513, and c = In IEEE double precision arithmetic, what are the results of [a [b c]] and [[a b] c]? Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

23 Example question Question 7 What is the result calling the function fun(2.0,0.0) defined below (using IEEE floating point values)? fun(a, b) { res = 1/(1/a + 1/b) return(res) } Question 8 What is a potential problem of the following program? How can it be fixed? hypot(a, b) { c = a 2 + b 2 c = sqrt(c) return(c) } Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

24 Example questions Task Calculating the roots of a quadratic polynomial with roots() as below can cause inaccurate results (e.g. a = c = 1, b = 10 8 for double precision). Write a code that avoids cancellation as much as possible, e.g. by using r 1 r 2 = c/a. Are there other potential problems, such as overflow etc.? roots(a, b, c) { r 1 = b+ b 2 4ac 2a r 2 = b b 2 4ac 2a return(r 1, r 2 ) } Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

25 Error Analysis f : A B, y = f(x) Es.: f(x) = x 2 + sin(2 x) x = π, f(x) =? 123 Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

26 Error Analysis f : A B, y = f(x) Es.: f(x) = x 2 + sin(2 x) x = π, f(x) =? 123 Exact arithmetic: ( π ) 2 ( + sin 2 π ) =... Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

27 Error Analysis f : A B, y = f(x) Es.: f(x) = x 2 + sin(2 x) x = π, f(x) =? 123 Exact arithmetic: ( π ) 2 ( + sin 2 π ) =... Inexact arithmetic: x ˆx, f ˆf ˆf(ˆx) instead of f(x) Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

28 Example: Dot Product x, y R n ; κ := x T y ( ((χ0 κ := ψ 0 + χ 1 ψ 1 ) + ) ) + χ n 2 ψ n 2 + χ n 1 ψ n 1 ˇκ = = ( ((χ0 ψ 0 (1 + ɛ (0) ) + χ 1 ψ 1 (1 + ɛ (1) ) ) (1 + ɛ (1) ) +χ n 1 ψ n 1 (1 + ɛ (n 1) ) (1 + ɛ (n 1) + ) n 1 i=0 χ i ψ i (1 + ɛ (i) ) n 1 (1 + ɛ (j) j=i + ) + ) + ) (1 + ɛ (n 2) + ) where ɛ (0) + = 0 and ɛ (0), ɛ (j), ɛ (j) + u for j = 1,..., n 1 Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

29 Backward Stability Let f : D R be a map from the domain D to the range R. Let ˆf : D R represent the execution in floating point arithmetic of a given algorithm A that computes f. A is said to be backward stable if for all x D there exists a perturbed input x D, close to x, such that ˆf(x) = f( x). Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

30 Backward Stability Let f : D R be a map from the domain D to the range R. Let ˆf : D R represent the execution in floating point arithmetic of a given algorithm A that computes f. A is said to be backward stable if for all x D there exists a perturbed input x D, close to x, such that ˆf(x) = f( x). I.e., the result computed in floating point arithmetic ( ˆf(x)) equals the result obtained when the mathematically exact function (f) is applied to slightly perturbed data ( x). The difference between x and x, is the perturbation to the original input x. Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, / 21

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 1 Scientific Computing Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction