Chapter 3. Errors and numerical stability

Chapter 3 Errors and numerical stability 1 Representation of numbers Binary system : micro-transistor in state off 0 on 1 Smallest amount of stored data bit Object in memory chain of 1 and 0 10011000110101001111010010100010 Byte 8 bits

On most micro-computers and workstations : word 4 bytes 32 bits double precision word 8 bytes 64 bits operations in double precision On certain super-computers (Cray) : word 8 bytes 64 bits double precision on Cray quadruple precision on a PC or workstation!

Integers An integer is stored in a word 4 bytes 32 bits In n-bit systems : i = s n 1 2 n 1 + s n 2 2 n 2 +... + s 2 2 2 + s 1 2 1 + s 0 2 0 s k = 0 or 1 [s n 1, s n 2... s 2, s 1, s 0 ] stored in memory Largest representable integer : 2 n 1 In a 32-bit system : 2 32 1 = 4 294 976 295 If n > 2 32 1 overflow

Signed integers What about negative integers? Binary arithmetic on words of fixed length arithmetic on a finite cyclic group Add 1 to the largest representable integer : 111...111 + 1 = 1 000...000 First 1 lost 0 In a n-bit system : addition is defined modulo 2 n

Binary representation of a negative integer? Inverse each bit (0 1 and 1 0), then add 1 Two s complement representation Examples : in a 3-bit system Integer 101 Add 010 111 Add 001 1000 = 0 Opposite of 101 010 + 001 = 011

Examples : in an 8-bit system -2 : 11111110-1 : 11111111 0 : 00000000 1 : 00000001 2 : 00000010 Conclusions The first bit indicates the sign of the integer : 1. if first bit 0 : integer > 0 (0 i 2 n 1 1) 2. if the first bit is 1 : integer <0 ( 2 n 1 i < 0)

Real numbers Floating point representation Scientific notation 0.6022 10 24 1. base (10) 2. exponent (24) 3. mantissa (0.6022)

In computers, base 2 only exponent and mantissa are stored x = s 2 e s sign (0 for positive, 1 for negative) e exponent (integer) f 1 = 1 for x 0 f 1 = 0 for x = 0 f k = 0 or 1 for k > 1 p f k 2 k Single precision real one word = 32 bits k=1 sign exponent mantissa

8-bit exponent 128 e 127 (2 7 = 128) 23-bit mantissa 1 2 M 1 ( 1 2) 23 0.99999988 Upper limit : 23 M = 0.1111... 111 = 2 k = k=1 ( 1 2 1 ( 1 2 1 1 2 ) 23 ) = 1 ( ) 23 1 2 Largest representable real : ( 1 Smallest non-zero representable real ( ) ) 23 1 2 127 10 38 2 1 2 2 128 10 39

In single precision, accuracy on real numbers 6 digits (1/2) 23 10 7 In double precision, real number 2 words = 64 bits 11-bit exponent 52-bit mantissa 10 308 x 10 308 2 210 1 = 2 1023 Accuracy 15 digits N.B. : Number of bits in exponent range Number of bits in mantissa precision

2 Consequences 1. Every number is not necessarily representable Rounding error Example : (0.1) 10 = (0.0001100110011...) 2 2. ɛ > 0 smallest representable number If 0 < x < ɛ x replaced by 0 Underflow Example : If a > 2 ɛ a 1 < 2 ɛ = 0 a 1 a = 0

3. Non-homogeneous distribution of real numbers Higher density near 0 Example : 16-bit system with 7-bit mantissa, 8-bit exponent Distance between 2 successive values of mantissa (0.0000001) 2 = 2 7 (0.0078) 10 Small exponent (e.g. -100) distance between 2 successive reals in floating point representation 0.0000001 2 2 100 0.0078 10 30 = 7.8 10 33 Large exponent (e.g. +100) distance between 2 successive reals in floating point representation 0.0000001 2 2 100 7.8 10 27

3 Numerical errors Arithmetic with a finite number of digits x approximation of x Absolute error ɛ a = x x Relative error ɛ r = ɛ a /x = x/x 1 Possible errors Example : Distributivity and associativity do not necessarily hold!!

Examples associativity of multiplication in 2-digit representation : (0.56 0.65) 0.54 = 0.36 0.54 = 0.19 0.56 (0.65 0.54) = 0.56 0.35 = 0.20 associativity of addition in 6-digit representation : (0.243875 10 6 +0.412648 10 1 ) 0.243826 10 6 = 0.243879 10 6 0.243826 10 6 = 0.000053 10 6 = 0.530000 10 2 (0.243875 10 6 0.243826 10 6 )+0.412648 10 1 = 0.000049 10 6 +0.412648 10 1 = 0.531265 10 2 distributivity in 6-digit representation : (0.152755 0.152732)/0.910939 30 = 0.252487 10 26 0.152755/0.910939 30 0.152732/0.910939 30 = 0.167690 10 30 0.167664 10 30 = 0.260000 10 26

3 classes of errors 1. Initial error 2. Truncation error 3. Rounding error Examples of truncation error Calculate e x using e x = 1 + x + x2 2! + x3 3! + x4 4! + O(x5 ) O(x n ) number of the same order as x n Calculate defined integral using quadrature rule b a dxf(x) n w i f(x i ) i=1

A numerical calculation comprises many steps! Examples : Suppose x = x + δ, ȳ = y + η z = x y z = ( x ȳ) = x + δ (y + η) + ɛ ɛ rounding error z = z + δ η + ɛ Depending on the sign of δ, η, ɛ, the error can be large or small. If x et y are almost equal, z is almost zero and the error is large z = x/y z = ( x/ȳ) = x + δ y + η + ɛ z z + 1 y δ x y 2 η + ɛ Error on z includes errors from x and y rounding error ɛ Si y is small, the error is large

It is important to understand how errors can propagate in a calculation 1. Small variations in initial data can give rise to big differences in final results ill-conditioned problem Examples : weather forecast, butterfly effect 2. Truncation errors usually depend on a parameter N if N : calculated solution exact solution Reduce truncation error by choosing N larger, might not be practical 3. Rounding errors accumulate randomly, and often cancel each other In certain cases, they can increase rapidly instability

Examples : Calculate e 1/3 in a 4-digit representation ( exact value = 1.39561242508608951) 1 3 0.3333 initial error = 0.000033333... Propagated error e 0.3333 e 1/3 = e 0.3333 (1 e 0.00003333... ) 0.0000465196 Calculate e x using expansion e x = 1 + x + x2 2! + x3 3! + x4 4! + O(x5 ) with x = 0.3333. If O(x 5 ) terms are neglected, truncation error ( ) 0.3333 5 + 0.33336 + 0.0000362750 5! 6! Summing truncated quantities 1 + 0.3333 + 0.0555 + 0.0062 + 0.0005 = 1.3955 In a 10-digit representation : result = 1.3955296304

Important source of rounding errors : Cancellation error in particular when two very close numbers are subtracted Example : In a 3-digit representation 1 15 1 16 = 0.667 10 1 0.625 10 1 = 0.420 10 2 Exact value : 0.417 10 2

Example : Solve x 2 178x + 2 = 0 x ± = b ± b 2 4ac 2a b 2 >> 4ac x implies the subtraction of two very close numbers x + = 1.779887634 10 2 x = 1.123665 10 2 Compare the number of significant digits! Improve accuracy by using x + x = c/a : x = 2 = 1.123666439 10 2 1.779887634 102

Example : Recurrence relations Particularly sensitive to propagations of initial and rounding errors Recurrence relation of Bessel function J n (x) J n+1 (x) = 2n x J n(x) J n 1 (x) If n > x, 2n/x > 1 multiplies errors in J n huge loss of accuracy For example, in a 6-digit representation J 0 (1) = 0.765198, J 1 (1) = 0.440051 J 7 (1) = 0.008605 instead of 0.000002!! Set J 8 (1) = 0, J 7 (1) = k, use backward recurrence relation and renormalise result using J 0 (x) + 2J 2 (x) + 2J 4 (x) +... = 1