Finite arithmetic and error analysis

Size: px

Start display at page:

Download "Finite arithmetic and error analysis"

Brooke Tyler
5 years ago
Views:

1 Finite arithmetic and error analysis Escuela de Ingeniería Informática de Oviedo (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 1 / 45

2 Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 2 / 45

3 Number representation: decimal and binary Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 3 / 45

4 Storing numbers Number representation: decimal and binary Numbers are stored in computers using two main formats: Integer format: exact storing of a finite set of integer numbers Floating point format: exact storing of a finite set of rational numbers The standard floating point representation is the IEEE 754 format (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 4 / 45

5 Number representation: decimal and binary Number representation: decimal The decimal floating point representation of a real number, x 0, is with σ = ±1, the sign, x R, the mantisse, n Z, the exponent x = σ ( x) n, Example: The normalized floating point representation of x = = is with a precision of 5 digits σ = +1, x = 31415, e = 2 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 5 / 45

6 Number representation: decimal and binary Number representation: binary Similarly, the binary floating point representation of a real number, x 0, is x = σ ( x) 2 2 e The representation is said to be normalized if Decimal case: (1) 10 x < (10) 10 Binary case: (1) 2 x < (10) 2 Example: The normalized floating point representation of x = ( ) 2 = ( ) is with a precision of 5 digits σ = +1, x = ( ) 2, e = (4) 1 0 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 6 / 45

7 Number representation: decimal and binary Significant digits and precision Significant digits of a number: digits of the mantissa, not counting leading zeros For normalized numbers, significant digits = number of digits in the mantissa Precision of a representation: maximum number, p, of significant digits that can be represented For a normalized representation, p = number of digits in the mantissa The precision may be finite, if p <, or infinite, if there is no limit to the number of digits in the mantissa (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 7 / 45

8 Number representation: decimal and binary Significant digits and precision x = ( ) 2 = ( ) 10 normalized floating point decimal representation with: σ = +1, x = , n = 0, normalized binary floating point representation with: σ = (1) 2, x = ( ) 2, e = (2) 10 = (10) 2 Thus, the number of significant digits is: 7 for the decimal representation, 9 for the binary representation (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 8 / 45

9 Number representation: decimal and binary Conversion from binary to decimal In the decimal system the number means: (107625) 10 = Similarly, in the binary system numbers are represented as an expansion of powers of 2: ( ) 2 = Conversion from binary to decimal is straightforward, performing the sum: ( ) 2 = = (107625) 10 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 9 / 45

10 Number representation: decimal and binary Conversion from decimal to binary (two steps): Integer part We sequentially divide by 2 and keep the remainders as the digits in base 2 We first write the last quotient that is not zero (it is always 1) and then the remainders, from right to left: Quotients Remainders Fractional part We sequentially multiply by 2 and subtract the integer part The binary digits are the remainders, written from left to right: Fractional Integer The result is: (107625) 10 = ( ) 2 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 10 / 45

11 Number representation: decimal and binary Example: integer representation with 4 bits Binary Unsigned Unsigned Signed Signed representation integers integers integers integers (Exp) (m = 4 bits) (signo en 1 er bit) bias = 2 m 1 bias = 2 m Reserved Reserved (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 11 / 45

12 Float point representation: standard IEEE 754 Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 12 / 45

13 Float point representation: standard IEEE 754 Standard IEEE 754 IEEE stands for Institute of Electrical and Electronics Engineers The standard IEEE 754 is the one used for floating point representation in computers and is used by almost all the processors Basic format floating-point numbers Binary format (b = 2) Decimal format (b = 10) parameter binary32 binary64 binary128 decimal64 decimal128 p, digits (p) e max e min = 1 e max There also exist extended and extendable precisions and they are recommended for extending the precisions used for arithmetic beyond the basic formats (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 13 / 45

14 Float point representation: standard IEEE 754 Standard IEEE 754 The IEEE 754 floating point binary representation of a number x 0 is x = σ x 2 e First bit for the sign, σ: 0 for positive, 1 for negative The exponent, e, is a signed integer following the IEEE 754 biased representation The mantissa is normalized: 1 x < (10) 2 This implies that the first digit must be 1, and then it is unnecessary to store it (a bit is saved) This is the hidden bit technique (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 14 / 45

15 Float point representation: standard IEEE 754 Single precision or binary32 sign exponent mantissa x = σ (1a 1 a 2 a 23 ) 2 e The numbers are encoded with 32 bits (4 bytes): 1 bit for the sign 8 bits for the exponent 23 bits for the mantissa (plus hidden bit = precision p = 24) Exponent bias is 2 m 1 1 = 127 = e [ 126, 127] (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 15 / 45

16 Float point representation: standard IEEE 754 Double precision or binary64 sign exponent mantissa x = σ (1a 1 a 2 a 52 ) 2 e The numbers are encoded with 64 bits (8 bytes): 1 bit for the sign 11 bits for the exponent 52 bits for the mantissa (plus hidden bit = precision p = 53) Exponent bias is 2 m 1 1 = 1023 = e [ 1022, 1023] (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 16 / 45

17 Float point representation: standard IEEE 754 Example: from decimal to binary32 ( ) 10 Mantissa For the fractional part, we get Fractional : Integer : and therefore, we store (0101) 2 For the integer part, we obtain Quotients : Remainders : and thus we store ( ) 2 The complete mantissa is written as ( ) 2 Following the IEEE standard, we normalize the mantissa as = , Due to the hidden bit technique, the first 1 is omitted and it is stored as (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 17 / 45

18 Float point representation: standard IEEE 754 Example: from decimal to binary32 Exponent The bias is 2 m 1 1 = 127 The base 10 biased exponent is then 6 + bias = = 133 Computing its binary representation we get ( ) 2 Quotients : Remainders : Sign Since the number is negative, the sign bit is 1 Therefore, the answer is sign exponent mantissa (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 18 / 45

19 Float point representation: standard IEEE 754 Special values (single precision) The largest exponent is e = ( ) 2 This exponent is reserved for: Infinity All the mantissa digits are zeros It is due to overflow Value sign exponent mantissa NaN (Not a Number) The mantissa is not identically zero There are two kind: QNaN (Quiet NaN), meaning indeterminate, and SNaN (Signaling NaN) meaning invalid operation Attempts to compute 0/0, 0 0, or similar expressions result in NaN Value sign exponent mantissa SNaN QNaN (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 19 / 45

20 Float point representation: standard IEEE 754 Special values (single precision) The smallest exponent is e = ( ) 2 This exponent is reserved for: Zero Since HB = 1, it is not representable as a normalized number: Value sign exponent mantissa Denormalized numbers We set HB = 0, and e = ( ) 2, although it is still represented with For example, sign exponent mantissa Advantage: since HB = 0, numbers smaller than the smallest normalized number may be represented However, the precision is smaller, since they have leading zeroes (at least, the hidden bit) (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 20 / 45

21 Example Float point representation: standard IEEE 754 Compute the base 10 value and the precision representation of the number sign exponent mantissa Since the exponent is and the mantissa is not identically zero, the number is denormalized Thus, the exponent is e min = 126, and the hidden bit is 0 Therefore, it represents the number ( ) 2 126, with precision p = 24 4 = 20 In decimal base, is given by ( ) that it is less than R min = (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 21 / 45

22 Example Float point representation: standard IEEE 754 Compute the smallest denormalized numbers in single and double precision In single precision, it is sign exponent mantissa representing, in binary base, ( ) = = , which has a precision p = 1 Similarly, in double precision we get (2 52 ) = (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 22 / 45

23 Accuracy Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 23 / 45

24 Accuracy Accuracy We have two main ways of measuring the accuracy of floating point arithmetics: The machine epsilon, ɛ, which is the difference between 1 and the next number, x > 1, which is representable The largest integer, M, such that any other positive integer, x M, is representable (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 24 / 45

25 Accuracy Machine epsilon: single precision The normalized format in single precision is σ (1a 1 a 2 a 22 a 23 ) 2 e If we write 1 in this format +1 (100 00) 2 0 the next number that can be stored in this format is 1 + ɛ = +1 (100 01) 2 0 The machine epsilon ɛ is the gap between these two numbers ɛ = +1 (000 01) 2 0 that normalized is written ɛ = +1 (100 00) 2 23 then ɛ = (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 25 / 45

26 Accuracy Machine epsilon: double precision The normalized format in double precision is σ (1a 1 a 2 a 51 a 52 ) 2 e If we write 1 in this format +1 (100 00) 2 0 the next number that can be stored in this format is 1 + ɛ = +1 (100 01) 2 0 The machine epsilon ɛ is the gap between these two numbers ɛ = +1 (000 01) 2 0 that normalized is written ɛ = +1 (100 00) 2 52 then ɛ = (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 26 / 45

27 Accuracy Machine epsilon IEEE simple precission: ɛ = (0 } 0 {{ 0 } 1) 2 = , 22 so we may store approximately 7 digits for a decimal number IEEE double precission: ɛ = (0 } 0 {{ 0 } 1) 2 = , 51 so we may store approximately 16 digits for a decimal number (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 27 / 45

28 Accuracy Largest integer: single precision The largest integer is M = 2 p For instance, for single precission (p = 24) Decimal Binary Mantissa Exp Representation represented 25 digits 1+23 bits Exact Exact Exact Exact Exact M = Exact Rounded Exact Rounded Exact (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 28 / 45

29 Accuracy Largest integer: double precision Decimal Binary Mantissa Exp Representation represented 54 digits 1+52 bits Exact Exact Exact Exact Exact Exact Rounded Exact Rounded Exact (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 29 / 45

30 Accuracy Largest integer Single IEEE precision : M = 2 24 = , and we can store exactly 6 digit integers Double IEEE precision: M = , and we can store exactly 15 digit integers and almost all 16 digit integers (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 30 / 45

31 Accuracy Overflow and underflow The largest normalized number that can be represented in double precision is, in binary format (±1) (111 11) In decimal format R max = ±( ) ± The smallest positive normalized number that can be represented in double precision is, in binary format In decimal format (±1) (100 00) R min = ± ± (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 31 / 45

32 Accuracy Overflow and underflow An overflow error is produced when trying to use a number too large (greater than the corresponding R max ): In most computers, execution is aborted IEEE format may support them by assigning the symbolic values ± or NaN An underflow error is produced when trying to use a number too small (less, in absolute value, than the corresponding R min ) Two possible behaviors: It lies in the range of denormalized numbers, so it is still representable In this case, precision decreases and it is called gradual underflow Otherwise, it is identified to 0 In both cases, execution continues (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 32 / 45

33 Rounding Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 33 / 45

34 Rounding Rounding When operations lead to a number with a mantissa containing more digits than the precision, the number must be approximated by another representable number In the norm IEEE 754 we have five procedures to approximate x: Round up: taking the closest representable larger number Round down: taking the closest representable smaller number Round towards zero (truncation): replacing the non representable digits by zero Round towards infinity: taking the closest that is the farthest from the zero Round to the nearest even representable digit (rounding) The most usual procedures are truncation and rounding (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 34 / 45

35 Rounding Decimal representation rounding Consider the base 10 number ( x = ±d 0 d 1 d 2 10 n = ± d k 10 k) 10 n, (1) with d k = 0, 1,, 9, for all k, and d 0 0 We have, for a precision of p digits, Truncation: x = ±d 0 d 1 d 2 d p 1 10 n k=0 Rounding: ±d 0 d 1 d 2 d p 1 10 n if 0 d p 4, x = ± ( d 0 d 1 d 2 d p (p 1)) 10 n if 5 < d p 9, nearest number ending in even if d p = 5, d p+k = 0 for all k > 0 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 35 / 45

36 Rounding Example Example: x = and p = 4 Truncation x = Rounding x = 1000 Example: x = and p = 3 Truncation x = 0433 Rounding x = 0433 Example: x = and p = 3 Truncation x = 0433 Rounding x = 0434 (towards the nearest even representable digit) Example x = and p = 3 Truncation x = 0434 Rounding x = 0434 (towards the nearest even representable digit) (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 36 / 45

37 Rounding Binary representation rounding In this case, the number takes the form ( x = ±1b 1 b 2 2 e = ± b k 2 k) 2 e, with b k = 0, 1 for all k For a precision p (including the hidden bit), we have Truncation: x = ±1b 1 b 2 b p 1 2 e k=0 Rounding: ±1b 1 b 2 b p 1 2 e if b p = 0, x = ± ( 1b 1 b 2 b p (p 1)) 2 e if b p = 1 and b p+k = 1 for some k > 0, nearest number ending in 0 if b p = 1 and b p+k = 0 for all k > 0 (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 37 / 45

38 Rounding Example Example: Para x = y p = 3 Truncation x = 111 Rounding x = 100 Example: Para x = y p = 3 Truncation x = 111 Rounding x = 111 Example: Para x = y p = 3 Truncation x = 100 Rounding x = 100 (towards the nearest even representable digit) Example: Para x = y p = 3 Truncation x = 101 Rounding x = 110 (towards the nearest even representable digit) (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 38 / 45

39 Rounding Truncation versus rounding in the binary system If truncating, we have ( x xt = b k 2 k) 2 e 2 (p 1) 2 e, k=p where we used the formula for summing a geometric series If rounding, x, is always, at worst, halfway between the two nearest representable numbers Thus, Consequences: x x r (p 1) 2 e = 2 p 2 e Largest truncation error is double of largest rounding error Truncation error is always non-positive, while rounding error may change sign (and compensate) Therefore, errors are less amplified when using rounding (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 39 / 45

40 Rounding Example Let x = ( ) 2 We approximate by Truncation to 5 binary digits, x t = (11001) 2 Then x x t = ( ) 2 = = Rounding to 5 binary digits, x r = (11010) 2 In this case x x r = ( ) 2 = = (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 40 / 45

41 Error Outline 1 Number representation: decimal and binary 2 Float point representation: standard IEEE Accuracy 4 Rounding 5 Error (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 41 / 45

42 Error Numerical instability Rounding errors arising in finite arithmetic computation are small in each computation, but may accumulate and propagate when an algorithm consists of many computations or iterations, resulting in a large difference between the exact solution and the solution computed numerically This effect is known as numerical instability of an algorithm (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 42 / 45

43 Error Example Consider the sequence s k = k, for k = 1, 2,, and compute whose result is just However, in simple precision we obtain x k = 1 s k + 2 s k + + k s k, x k = 1 for all k = 1, 2, k xk x k xk (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 43 / 45

44 Error Absolute and relative errors There are two main measures of the error made when approximating a number x by an approximation x : Absolute error: e a = x x Relative error: e r = x x x Relative error is independent of the scale and thus often more meaningful than absolute error, as we see in the following example: x x e a e r (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 44 / 45

45 Error Significant digits We say that x approximates x with p significant digits if p is the largest nonnegative integer such that x x x 5 10 p Examples: x = approximates x = with p = 2 meaningful digits: x x x = = = x = approximates x = with p = 2 digits: x x x = = = x = 9998 approximates x = 1000 with p = 4 meaningful digits: x x x = = = (Dpto de Matemáticas-UniOvi) Numerical Computation Finite arithmetic and error analysis 45 / 45

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI 3.5 Today s Contents Floating point numbers: 2.5, 10.1, 100.2, etc.. How