Finite arithmetic and error analysis

Finite arithmetic and error analysis Escuela de Ingeniería Informática de Oviedo (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 1 / 45

Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE 754 3 Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 2 / 45

Number representation: decimal and binary Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE 754 3 Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 3 / 45

Storing numbers Number representation: decimal and binary Numbers are stored in computers using two main formats: Integer format: exact storing of a finite set of integer numbers Floating point format: exact storing of a finite set of rational numbers The standard floating point representation is the IEEE 754 format (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 4 / 45

Number representation: decimal and binary Number representation: decimal The decimal floating point representation of a real number, x 0, is with σ = ±1, the sign, x R, the mantisse, n Z, the exponent x = σ ( x) 10 10 n, Example: The normalized floating point representation of x = 31415 = 31415 10 2 is with a precision of 5 digits σ = +1, x = 31415, e = 2 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 5 / 45

Number representation: decimal and binary Number representation: binary Similarly, the binary floating point representation of a real number, x 0, is x = σ ( x) 2 2 e The representation is said to be normalized if Decimal case: (1) 10 x < (10) 10 Binary case: (1) 2 x < (10) 2 Example: The normalized floating point representation of x = (1010111001) 2 = (1010111001) 2 2 4 is with a precision of 5 digits σ = +1, x = (1010111001) 2, e = (4) 1 0 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 6 / 45

Number representation: decimal and binary Significant digits and precision Significant digits of a number: digits of the mantissa, not counting leading zeros For normalized numbers, significant digits = number of digits in the mantissa Precision of a representation: maximum number, p, of significant digits that can be represented For a normalized representation, p = number of digits in the mantissa The precision may be finite, if p <, or infinite, if there is no limit to the number of digits in the mantissa (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 7 / 45

Number representation: decimal and binary Significant digits and precision x = (101001101) 2 = (5203125) 10 normalized floating point decimal representation with: σ = +1, x = 5203125, n = 0, normalized binary floating point representation with: σ = (1) 2, x = (101001101) 2, e = (2) 10 = (10) 2 Thus, the number of significant digits is: 7 for the decimal representation, 9 for the binary representation (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 8 / 45

Number representation: decimal and binary Conversion from binary to decimal In the decimal system the number 107625 means: (107625) 10 = 1 10 2 + 7 10 0 + 6 10 1 + 2 10 2 + 5 10 3 Similarly, in the binary system numbers are represented as an expansion of powers of 2: (1101011101) 2 = 2 6 + 2 5 + 2 3 + 2 1 + 2 0 + 2 1 + 2 3 Conversion from binary to decimal is straightforward, performing the sum: (1101011101) 2 = 2 6 + 2 5 + 2 3 + 2 1 + 2 0 + 2 1 + 2 3 = (107625) 10 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 9 / 45

Number representation: decimal and binary Conversion from decimal to binary (two steps): Integer part We sequentially divide by 2 and keep the remainders as the digits in base 2 We first write the last quotient that is not zero (it is always 1) and then the remainders, from right to left: Quotients 107 53 26 13 6 3 1 Remainders 1 1 0 1 0 1 Fractional part We sequentially multiply by 2 and subtract the integer part The binary digits are the remainders, written from left to right: Fractional 0625 025 05 0 Integer 1 0 1 The result is: (107625) 10 = (1101011101) 2 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 10 / 45

Number representation: decimal and binary Example: integer representation with 4 bits Binary Unsigned Unsigned Signed Signed representation integers integers integers integers (Exp) (m = 4 bits) (signo en 1 er bit) bias = 2 m 1 bias = 2 m 1 1 0000 0 +0 8 Reserved 0001 1 +1 7 6 0010 2 +2 6 5 0011 3 +3 5 4 0100 4 +4 4 3 0101 5 +5 3 2 0110 6 +6 2 1 0111 7 +7 1 0 1000 8 0 0 1 1001 9 1 1 2 1010 10 2 2 3 1011 11 3 3 4 1100 12 4 4 5 1101 13 5 5 6 1110 14 6 6 7 1111 15 7 7 Reserved (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 11 / 45

Float point representation: standard IEEE 754 Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE 754 3 Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 12 / 45

Float point representation: standard IEEE 754 Standard IEEE 754 IEEE stands for Institute of Electrical and Electronics Engineers The standard IEEE 754 is the one used for floating point representation in computers and is used by almost all the processors Basic format floating-point numbers Binary format (b = 2) Decimal format (b = 10) parameter binary32 binary64 binary128 decimal64 decimal128 p, digits (p) 24 53 113 16 34 e max +127 +1023 +16383 +384 +6144 e min = 1 e max There also exist extended and extendable precisions and they are recommended for extending the precisions used for arithmetic beyond the basic formats (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 13 / 45

Float point representation: standard IEEE 754 Standard IEEE 754 The IEEE 754 floating point binary representation of a number x 0 is x = σ x 2 e First bit for the sign, σ: 0 for positive, 1 for negative The exponent, e, is a signed integer following the IEEE 754 biased representation The mantissa is normalized: 1 x < (10) 2 This implies that the first digit must be 1, and then it is unnecessary to store it (a bit is saved) This is the hidden bit technique (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 14 / 45

Float point representation: standard IEEE 754 Single precision or binary32 sign exponent mantissa x = σ (1a 1 a 2 a 23 ) 2 e The numbers are encoded with 32 bits (4 bytes): 1 bit for the sign 8 bits for the exponent 23 bits for the mantissa (plus hidden bit = precision p = 24) Exponent bias is 2 m 1 1 = 127 = e [ 126, 127] (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 15 / 45

Float point representation: standard IEEE 754 Double precision or binary64 sign exponent mantissa x = σ (1a 1 a 2 a 52 ) 2 e The numbers are encoded with 64 bits (8 bytes): 1 bit for the sign 11 bits for the exponent 52 bits for the mantissa (plus hidden bit = precision p = 53) Exponent bias is 2 m 1 1 = 1023 = e [ 1022, 1023] (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 16 / 45

Float point representation: standard IEEE 754 Example: from decimal to binary32 ( 118625) 10 Mantissa For the fractional part, we get Fractional : 0625 025 05 0 Integer : 1 0 1 and therefore, we store (0101) 2 For the integer part, we obtain Quotients : 118 59 29 14 7 3 1 Remainders : 0 1 1 0 1 1 and thus we store (1110110) 2 The complete mantissa is written as (1110110101) 2 Following the IEEE standard, we normalize the mantissa as 1110110101 = 1110110101 2 6, Due to the hidden bit technique, the first 1 is omitted and it is stored as 11011010100000000000000 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 17 / 45

Float point representation: standard IEEE 754 Example: from decimal to binary32 Exponent The bias is 2 m 1 1 = 127 The base 10 biased exponent is then 6 + bias = 6 + 127 = 133 Computing its binary representation we get (10000101) 2 Quotients : 133 66 33 16 8 4 2 1 Remainders : 1 0 1 0 0 0 0 Sign Since the number is negative, the sign bit is 1 Therefore, the answer is sign exponent mantissa 1 10000101 11011010100000000000000 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 18 / 45

Float point representation: standard IEEE 754 Special values (single precision) The largest exponent is e = (11111111) 2 This exponent is reserved for: Infinity All the mantissa digits are zeros It is due to overflow Value sign exponent mantissa + 0 11111111 00000000000000000000000 1 11111111 00000000000000000000000 NaN (Not a Number) The mantissa is not identically zero There are two kind: QNaN (Quiet NaN), meaning indeterminate, and SNaN (Signaling NaN) meaning invalid operation Attempts to compute 0/0, 0 0, or similar expressions result in NaN Value sign exponent mantissa SNaN 0 11111111 10000000000000001000000 QNaN 1 11111111 00000010000000010000000 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 19 / 45

Float point representation: standard IEEE 754 Special values (single precision) The smallest exponent is e = (00000000) 2 This exponent is reserved for: Zero Since HB = 1, it is not representable as a normalized number: Value sign exponent mantissa +0 0 00000000 00000000000000000000000 0 1 00000000 00000000000000000000000 Denormalized numbers We set HB = 0, and e = (00000001) 2, although it is still represented with 00000000 For example, sign exponent mantissa 0 00000000 00001000010000000001000 1 00000000 01000100000000000010000 Advantage: since HB = 0, numbers smaller than the smallest normalized number may be represented However, the precision is smaller, since they have leading zeroes (at least, the hidden bit) (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 20 / 45

Example Float point representation: standard IEEE 754 Compute the base 10 value and the precision representation of the number sign exponent mantissa 0 00000000 00010110000000000000000 Since the exponent is 00000000 and the mantissa is not identically zero, the number is denormalized Thus, the exponent is e min = 126, and the hidden bit is 0 Therefore, it represents the number (00001011) 2 126, with precision p = 24 4 = 20 In decimal base, is given by (2 4 + 2 6 + 2 7 ) 2 126 10102 10 39 that it is less than R min = 2 126 11755 10 38 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 21 / 45

Example Float point representation: standard IEEE 754 Compute the smallest denormalized numbers in single and double precision In single precision, it is sign exponent mantissa 0 00000000 00000000000000000000001 representing, in binary base, (000000000000000000000001) 2 126 = 2 23 2 126 = 2 149 14013 10 45, which has a precision p = 1 Similarly, in double precision we get (2 52 ) 2 1022 = 2 1074 49407 10 324 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 22 / 45

Accuracy Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE 754 3 Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 23 / 45

Accuracy Accuracy We have two main ways of measuring the accuracy of floating point arithmetics: The machine epsilon, ɛ, which is the difference between 1 and the next number, x > 1, which is representable The largest integer, M, such that any other positive integer, x M, is representable (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 24 / 45

Accuracy Machine epsilon: single precision The normalized format in single precision is σ (1a 1 a 2 a 22 a 23 ) 2 e If we write 1 in this format +1 (100 00) 2 0 the next number that can be stored in this format is 1 + ɛ = +1 (100 01) 2 0 The machine epsilon ɛ is the gap between these two numbers ɛ = +1 (000 01) 2 0 that normalized is written ɛ = +1 (100 00) 2 23 then ɛ = 2 23 119 10 7 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 25 / 45

Accuracy Machine epsilon: double precision The normalized format in double precision is σ (1a 1 a 2 a 51 a 52 ) 2 e If we write 1 in this format +1 (100 00) 2 0 the next number that can be stored in this format is 1 + ɛ = +1 (100 01) 2 0 The machine epsilon ɛ is the gap between these two numbers ɛ = +1 (000 01) 2 0 that normalized is written ɛ = +1 (100 00) 2 52 then ɛ = 2 52 222 10 16 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 26 / 45

Accuracy Machine epsilon IEEE simple precission: ɛ = (0 } 0 {{ 0 } 1) 2 = 2 23 119 10 7, 22 so we may store approximately 7 digits for a decimal number IEEE double precission: ɛ = (0 } 0 {{ 0 } 1) 2 = 2 52 222 10 16, 51 so we may store approximately 16 digits for a decimal number (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 27 / 45

Accuracy Largest integer: single precision The largest integer is M = 2 p For instance, for single precission (p = 24) Decimal Binary Mantissa Exp Representation represented 25 digits 1+23 bits 1 000 001 100 000 0 Exact 2 000 010 100 000 1 Exact 3 000 011 110 000 1 Exact 4 000 100 100 000 2 Exact 16777215 011 111 111 111 23 Exact M = 2 24 16777216 100 000 100 0000 24 Exact 16777216 100 001 100 0001 24 Rounded 16777218 100 010 100 0010 24 Exact 16777220 100 011 100 0011 24 Rounded 16777220 100 100 100 0100 24 Exact (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 28 / 45

Accuracy Largest integer: double precision Decimal Binary Mantissa Exp Representation represented 54 digits 1+52 bits 1 000 001 100 000 0 Exact 2 000 010 100 000 1 Exact 3 000 011 110 000 1 Exact 4 000 100 100 000 2 Exact 9007199254740991 011 111 111 111 52 Exact 2 53 9007199254740992 100 000 100 0000 53 Exact 9007199254740992 100 001 100 0001 53 Rounded 9007199254740994 100 010 100 0010 53 Exact 9007199254740996 100 011 100 0011 53 Rounded 9007199254740996 100 100 100 0100 53 Exact (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 29 / 45

Accuracy Largest integer Single IEEE precision : M = 2 24 = 16777216, and we can store exactly 6 digit integers Double IEEE precision: M = 2 53 90 10 15, and we can store exactly 15 digit integers and almost all 16 digit integers (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 30 / 45

Accuracy Overflow and underflow The largest normalized number that can be represented in double precision is, in binary format (±1) (111 11) 2 1023 In decimal format R max = ±(1 + 1 2 1 + 1 2 2 + 1 2 3 + + 1 2 52 ) 2 1023 ±17977 10 308 The smallest positive normalized number that can be represented in double precision is, in binary format In decimal format (±1) (100 00) 2 1022 R min = ±2 1022 ±22251 10 308 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 31 / 45

Accuracy Overflow and underflow An overflow error is produced when trying to use a number too large (greater than the corresponding R max ): In most computers, execution is aborted IEEE format may support them by assigning the symbolic values ± or NaN An underflow error is produced when trying to use a number too small (less, in absolute value, than the corresponding R min ) Two possible behaviors: It lies in the range of denormalized numbers, so it is still representable In this case, precision decreases and it is called gradual underflow Otherwise, it is identified to 0 In both cases, execution continues (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 32 / 45

Rounding Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE 754 3 Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 33 / 45

Rounding Rounding When operations lead to a number with a mantissa containing more digits than the precision, the number must be approximated by another representable number In the norm IEEE 754 we have five procedures to approximate x: Round up: taking the closest representable larger number Round down: taking the closest representable smaller number Round towards zero (truncation): replacing the non representable digits by zero Round towards infinity: taking the closest that is the farthest from the zero Round to the nearest even representable digit (rounding) The most usual procedures are truncation and rounding (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 34 / 45

Rounding Decimal representation rounding Consider the base 10 number ( x = ±d 0 d 1 d 2 10 n = ± d k 10 k) 10 n, (1) with d k = 0, 1,, 9, for all k, and d 0 0 We have, for a precision of p digits, Truncation: x = ±d 0 d 1 d 2 d p 1 10 n k=0 Rounding: ±d 0 d 1 d 2 d p 1 10 n if 0 d p 4, x = ± ( d 0 d 1 d 2 d p 1 + 10 (p 1)) 10 n if 5 < d p 9, nearest number ending in even if d p = 5, d p+k = 0 for all k > 0 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 35 / 45

Rounding Example Example: x = 0999953 and p = 4 Truncation x = 09999 Rounding x = 1000 Example: x = 0433309 and p = 3 Truncation x = 0433 Rounding x = 0433 Example: x = 0433500 and p = 3 Truncation x = 0433 Rounding x = 0434 (towards the nearest even representable digit) Example x = 0434500 and p = 3 Truncation x = 0434 Rounding x = 0434 (towards the nearest even representable digit) (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 36 / 45

Rounding Binary representation rounding In this case, the number takes the form ( x = ±1b 1 b 2 2 e = ± b k 2 k) 2 e, with b k = 0, 1 for all k For a precision p (including the hidden bit), we have Truncation: x = ±1b 1 b 2 b p 1 2 e k=0 Rounding: ±1b 1 b 2 b p 1 2 e if b p = 0, x = ± ( 1b 1 b 2 b p 1 + 2 (p 1)) 2 e if b p = 1 and b p+k = 1 for some k > 0, nearest number ending in 0 if b p = 1 and b p+k = 0 for all k > 0 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 37 / 45

Rounding Example Example: Para x = 11111 y p = 3 Truncation x = 111 Rounding x = 100 Example: Para x = 11101 y p = 3 Truncation x = 111 Rounding x = 111 Example: Para x = 10010 y p = 3 Truncation x = 100 Rounding x = 100 (towards the nearest even representable digit) Example: Para x = 10110 y p = 3 Truncation x = 101 Rounding x = 110 (towards the nearest even representable digit) (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 38 / 45

Rounding Truncation versus rounding in the binary system If truncating, we have ( x xt = b k 2 k) 2 e 2 (p 1) 2 e, k=p where we used the formula for summing a geometric series If rounding, x, is always, at worst, halfway between the two nearest representable numbers Thus, Consequences: x x r 1 2 2 (p 1) 2 e = 2 p 2 e Largest truncation error is double of largest rounding error Truncation error is always non-positive, while rounding error may change sign (and compensate) Therefore, errors are less amplified when using rounding (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 39 / 45

Rounding Example Let x = (11001101) 2 We approximate by Truncation to 5 binary digits, x t = (11001) 2 Then x x t = (00000101) 2 = 2 5 + 2 7 = 00390625 Rounding to 5 binary digits, x r = (11010) 2 In this case x x r = (00000011) 2 = 2 6 + 2 7 = 00234375 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 40 / 45

Error Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE 754 3 Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 41 / 45

Error Numerical instability Rounding errors arising in finite arithmetic computation are small in each computation, but may accumulate and propagate when an algorithm consists of many computations or iterations, resulting in a large difference between the exact solution and the solution computed numerically This effect is known as numerical instability of an algorithm (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 42 / 45

Error Example Consider the sequence s k = 1 + 2 + + k, for k = 1, 2,, and compute whose result is just However, in simple precision we obtain x k = 1 s k + 2 s k + + k s k, x k = 1 for all k = 1, 2, k xk x k xk 10 1 1000000 00 10 3 0999999 10 10 7 10 6 09998996 1004 10 4 10 7 1002663 2663 10 3 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 43 / 45

Error Absolute and relative errors There are two main measures of the error made when approximating a number x by an approximation x : Absolute error: e a = x x Relative error: e r = x x x Relative error is independent of the scale and thus often more meaningful than absolute error, as we see in the following example: x x e a e r 03 10 1 031 10 1 01 0333 10 1 03 10 3 031 10 3 01 10 4 0333 10 1 03 10 4 031 10 4 01 10 3 0333 10 1 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 44 / 45

Error Significant digits We say that x approximates x with p significant digits if p is the largest nonnegative integer such that x x x 5 10 p Examples: x = 12445 approximates x = 12345 with p = 2 meaningful digits: x x x = 1 12345 = 00081 005 = 5 10 2 x = 00012445 approximates x = 00012345 with p = 2 digits: x x x = 000001 00012345 = 00081 005 = 5 10 2 x = 9998 approximates x = 1000 with p = 4 meaningful digits: x x x = 02 1000 = 00002 00005 = 5 10 4 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 45 / 45