Floating-Point Arithmetic

Size: px

Start display at page:

Download "Floating-Point Arithmetic"

Christine Gray
5 years ago
Views:

1 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic MCS 471 Lecture 1(b) Numerical Analysis Jan Verschelde, 18 June 2018 Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

2 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

3 Numerical Analysis a definition Definition (Nick Trefethen, SIAM News 1992) Numerical analysis is the study of algorithms for the problems of continuous mathematics. We care for the efficiency and accuracy of algorithms. In continuous models to solve problems, we obtain approximate answers for approximate input data. Two related disciplines: Computer Algebra to formulate and re-formulate problems. Scientific Computing, for applications to science. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

4 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

5 sources of error Some sources of error are truncation errors in mathematical models; observed input data are approximate numbers; 1 representation errors, e.g.: 10 in binary; roundoff error during calculations. In numerical analysis, we ask two important questions: 1 How sensitive is the output to changes in the input? 2 Do roundoff errors in an algorithm propagate? Answers to these two questions, are addressed respectively by 1 numerical conditioning is a property of a problem; 2 numerical stability is a property of an algorithm. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

6 absolute and relative error Definition (absolute error) Let x be an approximation for x. The absolute error x is the absolute value of the difference of x with x: Definition (relative error) x = x x. Let x be an approximation for x. The relative error δx is the absolute error divided by the absolute value of x: δx = x x. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

7 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

8 floating-point numbers A floating-point number consists of 1 one sign bit, 2 a normalized fraction: the leading bit is nonzero, and 3 an exponent. Definition The floating-point representation f l(x) of a real number x R is f l(x) = ±.bb... b 2 e, stored compactly as the tuple (±, e, bb... b). The representation error is f l(x) x. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

9 floating-point formats Hardware supports single precision (32-bit), double precision (64-bit), and long double precision (80-bit), summarized below: number of bits precision sign exponent fraction total single double long double A 64-bit floating-point number has 1 sign bit s, 0 for positive, 1 for negative, 11 bits e 1, e 2,..., e 11 in the exponent, and 52 bits f 1, f 2,..., f 52 in the fraction, f 1 0. s e 1 e 2 e 11 f 1 f 2 f 52 Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

10 a number line Consider a floating-point number system with basis 2 1 with two bits in the (normalized) fraction, and 2 with exponents 1, 0, +1, +2. We display all positive floating-point numbers in this system: = 0.01 = 1/ = = 3/ = 0.1 = 1/ = 0.11 = 3/ = = 1.1 = 3/ = 10 = = 11 = error f l(x) x 1/8 error f l(x) x 1/2 Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

11 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

12 machine precision Definition The number machine precision ɛ mach is the distance between 1 and the smallest floating-point number greater than one. For basis B and size p of the fraction: ɛ mach = B p. For 0 < ɛ < ɛ mach : (1 + ɛ) 1 ɛ + (1 1). The machine precision as supported by hardware single floats (32-bit), double floats (64-bit), and long double floats (80-bit) is below: number of bits machine precision sign exponent fraction total precision single e-07 double e-16 long double e-20 Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

13 the smallest and largest exponent An exponent e [e min, e max ] where e min is the smallest exponent and e max is the largest exponent. number of bits exponent range precision sign exponent fraction total e min e max single double long double Special values for the exponent for double precision: , nonzero fraction : -NaN, not a number; , zero fraction : -Inf, represents ; : numbers that are not normalized; , zero fraction : +Inf, represents +. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

14 exponent encoding The exponents are encoded with an offset, minus 1023 for double: julia> a = 2.0^(-1022) e-308 julia> bits(a) " " We see that the smallest exponent is julia> b = 2.0^ e307 julia> bits(b) " " We see that the largest exponent is Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

15 the smallest and largest double julia> a = nextfloat(0.0) 5.0e-324 julia> bits(a) " " The smallest number is not normalized: = julia> b = prevfloat(inf) e308 julia> bits(b) " " The largest number has exponent and fraction 1 + ( ). julia> bits(2.0^1023*(1 + (1-2.0^(-52)))) " " Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

16 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

17 adding two floating-point numbers Consider two numbers in a system with 4 as the size of fraction: x = and y = Four steps to add two floating-point numbers: 1 Align the numbers so they both have the same exponent. y = = = Perform the addition Round the result: x + y = Normalize the result: x + y = Exercise 1: check the accuracy of the sum. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

18 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

19 loss of significance Consider two numbers in a system with 4 as the size of fraction: x = and y = Compute x y: After normalization: x y = Problem: x and y have 4 bits of significance, the result x y has only one significant bit of accuracy. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

20 restructuring a calculation Consider in a three decimal digit number system. In this system, 3 is represented by represented by The subtraction will thus yield zero. We can avoid the subtraction: = ( )( ) = The difference in the numerator is not zero: minus yields Dividing by = results in Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

21 Floating-Point Arithmetic 1 Numerical Analysis a definition sources of error 2 Floating-Point Numbers floating-point representation of a real number machine precision 3 Floating-Point Arithmetic adding two floating-point numbers loss of significance 4 Arbitrary Precision and Interval Arithmetic extending floating-point arithmetic Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

22 extending floating-point arithmetic Two ways to extend floating-point arithmetic: 1 arbitrary precision floating-point arithmetic The GNU Multiprecision Arithmetic Library and the GNU MPFR library provide arbitrary-precision integers and floating-point numbers, wrapped in Julia by the types BigInt and BigFloat. See the methods precision() and setprecision() to query the precision (in bits) and to set the precision (also in bits). 2 interval arithmetic Instead of one number, we can calculate with an interval [a, b], where a is the lower and b the upper bound for the approximation. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

23 setting the precision of BigFloat Consider the following session: julia> setprecision(bigfloat, 5); julia> x = sqrt(bigfloat(9.01)) 3.00 Exercise 2: What is the smallest value for the precision for a BigFloat to see the difference between 3 and sqrt(bigfloat(9.01)? Exercise 3: Is there a value for the precision for a BigFloat for which 3 is the same as sqrt(bigfloat(9.01) but for which (9.01-9)/(sqrt(BigFloat(9.01)) + 3 yields a better result? Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

24 expression evaluation The example is taken from the paper of Stefano Taschini: Interval Arithmetic: Python Implementation and Applications Proceedings of the 7th Python in Science Conference (SciPy 2008). F(x, y) = (333//1 + 3//4 - x^2)*y^6 + x^2*(11//1*x^2*y^2-121//1*y^4-2//1) + (11//2)*y^8 + x/(2//1*y) (A, B) = (BigInt(77617)//1, BigInt(33096)//1) exact = F(A, B) Exercise 4: Compare the exact result with the evaluation with 64-bit floating-point arithmetic. What is the difference? Exercise 5: Find the smallest value for the precision for a BigFloat so the expression is evaluated correctly. Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

25 use interval arithmetic Exercise 6: Use the package IntervalArithmetic to evaluate the expression. Interpret the result of the evaluation. Demonstrate how interval arithmetic can be applied to find the correct value for the precision? Numerical Analysis (MCS 471) Floating-Point Arithmetic L-1(b) 18 June / 25

Roundoff Errors and Computer Arithmetic

Jim Lambers Math 105A Summer Session I 2003-04 Lecture 2 Notes These notes correspond to Section 1.2 in the text. Roundoff Errors and Computer Arithmetic In computing the solution to any mathematical problem,