P1 Engineering Computation

Size: px

Start display at page:

Download "P1 Engineering Computation"

Brent Berry
6 years ago
Views:

1 1EC / 1 P1 Engineering Computation David Murray david.murray@eng.ox.ac.uk dwm/courses/1ec Hilary 2001

2 1EC / 1 Why does Computer Arithmetic introduce Error?

3 1EC / 1 Arithmetic Operations Whenever we write, /, + and in a computer program we invoke built-in subroutines to perform the associated arithmetic operation between two numbers. Can we rely on the results returned? Not entirely. Because computers store numbers and operate on numbers of some restricted wordlength there are predictable uncertainties introduced into the results. These effects are more keenly felt when with floating point numbers, but integer operations can be affected, and we touch on this first.

1EC 2001 4 / 1 Representing Integers Within a computer, the representation of integers is exact, as is the performance of integer arithmetic but only within a finite range of numbers determined by

4 1EC / 1 Representing Integers Within a computer, the representation of integers is exact, as is the performance of integer arithmetic but only within a finite range of numbers determined by the wordlength. Consider a n-bit word. Each bit is 0 or 1 (2 items), and so n bits can represent 2 n different items. One obvious range is from 0 to 2 n 1, the range of unsigned integers. In some applications positive number suffice. For example, a grey-level image has an 8-bit number to represent image brightness on a scale of 0 (black) to 255 (white) at each pixel. More usually though, we need to split the range evenly between positive and negative numbers. To represent the ± sign in front of a number must take 1 bit of information, so within the n-bit wordlength we have now only (n 1) bits to represent the number s modulus.

5 1EC / 1 Integers /ctd Again let us use 8 bits as a concrete example. As shown in the tables, three possible methods are 1 to use sign and modulus directly, setting the most significant bit as sign bit. (Not nice.) 2 to bias the result by to use the 2 s complement representation Sign,Mod Biassed by s complement

6 1EC / 1 Integers /ctd You will learn more about 2 s complement arithmetic in the 1P2 course. Our sole aim here has been just to bring out three points: 1 Finite wordlength imposes restrictions (here on range). Although the word-length of most desktop microprocessors is 32bit with a maximum integer therefore of order 10 9, this is not that large! 2 Within range, integer arithmetic is EXACT. 3 There is no pre-ordained computer representation for numbers.

7 1EC / 1 Floating point arithmetic The 1st and 3rd points just made hold also for floating point arithmetic, but the 2nd does not. That is: 1 Finite wordlength imposes restrictions (on range AND accuracy) 2 Even within range, floating point arithmetic is not necessarily exact 3 There is no pre-ordained representation. What does not necessarily exact mean? Although many floating point numbers can be represented exactly within a restricted word length, others cannot. Now, if an exact number occurs as the result of some extended calculation one cannot be sure whether the number is an exact representation of one of these special numbers; or an inexact representation of one slightly different. Thus we must assume uncertainty in ALL floating point numbers. The error is called roundoff error.

8 1EC / 1 Representation of Floating Point Numbers To assess this uncertainty, we need to specify a representation for a floating point number. We also need a recipe for converting from a decimal representation to the bit pattern and vice versa. Although there is no God-given representation, the IEEE 1 in the US has defined standards to which computer manufacturers adhere. You ll know already that on a calculator scientific notation is an effective way of representing numbers in a large range even though the number of digits on the display is small. That is, one uses rather than The is called the mantissa, and the -7 is called the exponent. The binary recipe is similar except that 2 is raised to a power, not 10, and the exponent is biassed. The number is represented by Sign, Exponent and Mantissa as follows: f = SIGN 2 (EXPONENT Bias) MANTISSA 1 The Institute of Electrical and Electronic Engineers at God s right hand in these matters!

9 1EC / 1 IEEE representation using 32 bits Consider now a 32 bit wordlength: The Sign need only occupy 1bit The Exponent occupies (by choice) 8 bits. You might think that this would be a 2 s complement number, but no. It is a biassed number, and the bias is 127. The Mantissa occupies the remaining bits, and is normalized so that the most significant bit (msb) is 1. At first sight the mantissa must be 23 bits, but actually it is 24 bits long. How can this be? The trick is to notice that if mantissa is always normalized so that its leading bit is 1, then there is no need to explicitly represent the bit! It is referred to a the hidden bit. This gain in accuracy is paid for by the need for a special representation for 0. The binimal point in the mantissa comes just after the hidden bit.

10 1EC / 1 Examples According to these rules, the following are the representations for 1 and 0.5: 1.0 = [+] 2 0 [1] = [+] [1] = [1] = [+] = [+] 2 1 [1] = [+] [1] = [1] It is quite easy to write a program to write out the bits from numbers stored in a computer. Let s see if everything agrees: Look at 32-bit rep of 1.0, -1.0, 0.25, 0.26 and 0.0 on a PC % bit_decomp f 1.0 S:0 E: M:(1) Sign Hidden bit % bit_decomp f -1.0 S:1 E: M:(1)

11 1EC / 1 More examples % bit_decomp f 0.25 S:0 E: M:(1) % bit_decomp f 0.26 S:0 E: M:(1) The special representation of 0.0 mentioned earlier is that if ALL the actual hardware bits are zero, the number is zero. % bit_decomp f 0.0 S:0 E: M:(1) ^ NOT Hardware Remember!

12 1EC / 1 Converting from decimal to binary How does one convert 0.26 to the IEEE format? One way is to find repeatedly the largest fraction 1/2 m which is still less than the unconverted part of the number: 0.26 = = = = = etc 0.26 = etc = etc = et E = M = [1] etc in agreement with the output of the decomposition program. % bit_decomp f 0.26 S:0 E: M:(1) If you carried on, you would find that there is still a remainder even after all the bits were filled in the 24-bit mantissa cannot be represented exactly in a 32 bit floating point number.

13 1EC / 1 The largest and smallest numbers The manual pages tell me that the largest floating point number representable in 32 bits is or e Let s try the program... %bit_decomp f e+38 S:0 E: M:(1) Notice that not all the bits are 1. A number with all 1 s is reserved to represent Infinity, a number not infrequently produced in computer arithmetic by accidentally dividing by zero! The smallest finite number is written as e-45 and this has the bit pattern %bit_decomp f gibber-gibber...e-45 S:0 E: M:(1) Notice the 1 in the least significant bit (rightmost bit) of the mantissa. If it had been zero, the number would be zero.

14 1EC / 1 The accuracy of any number The smallest number is determined by the number of bits in the exponent (here 8 bits). However it is the number of bits in the mantissa that determines the accuracy of any number. In the IEEE representation, the leading hidden bit is 1, and each of the explicit 23 bits of the mantissa is either 0 or 1. However, we have no knowledge of bit 24, and so any finite number is uncertain by an amount 2 24 smaller than the leading bit. H /1 0/1... 0/1 0/1? So, any finite number V is actually V (1±ǫ), where ǫ = ǫ is called the ROUNDING or ROUNDOFF ERROR. It is a relative or fractional error, not an absolute error. This is similar to uncertainty using scientific notation on a calculator. Suppose your calculator has a five digit mantissa E 13± E 13 and so the number is V (1± ) again a relative error.

15 1EC / 1 But surely these round off errors are small... Two pieces of code you have designed for an embedded processor performs a summation every second. Unfortunately, they seem to produce wildly different results. After careful analysis, you find that in a loop in version 1 you are effectively summing ( ) every second, whereas in version 2 involves summing ( ). But both should be summing zero... and therefore give the same results... er, shouldn t they? You write a test program to see the difference over the period of a year... and find that... Sec_in_year * (5-10 * 0.5) = Sec_in_year * (3-10 * 0.3) = These sort of errors can easily build up in shorter periods of time. For example, the loops in the controller for aircraft flight surfaces might run at a 1 khz. A progressive shift on a flight from London to New York does not sound too attractive...

16 1EC / 1 Here s the program... #include <stdio.h> void main() { float pointfive = 0.5, pointthree = 0.3, three = 3.0, five = 5.0, ten = 10.0, siy = 365.0*24.0*3600.0; float answer1,answer2; answer1 = siy*(five-ten*pointfive); answer2 = siy*(three-ten*pointthree); printf("sec_in_year x (5-10 x 0.5) = %20.18f\n",answer1); printf("sec_in_year x (3-10 x 0.3) = %20.18f\n",answer2); }

17 1EC / 1 Sources of Error in Computational Engineering 1. Roundoff error 2. Modelling error In engineering, mathematical equations are generated to represent some physical system. These mathematical models will rarely represent the physical situation precisely, and must therefore give rise to modelling error. One should always question whether making some piece numerical analysis ever more precise is justified if the underlying model is imprecise. 3. Approximation/Truncation/Discretization Error Once you have decided on your model, there may be aspects where you have to make mathematical approximations. For example you may use the approximation e x2 1+x 2 + x 4 /2!+x 6 /3!+x 8 /4! Even with zero roundoff error in the calculation of x n, e x2 will still be in error.

18 1EC / 1 Sources of Error /ctd Measurement Error. Often the raw data for a piece of analysis comes from experiment. In lecture 2 we saw some data from a mass spectrometer which you wished to integrate. Then round-off and discretization errors might be insignificant compared with the measurement uncertainty in each datum. 5. Gross Error a.k.a. Bloopers. These arise in experimental measurements, but lie outside the expected probabilistic variation of a quantity. They may arise from human error (say, writing down 3.8 rather than 8.3), or from an intermittent fault occuring in a piece of apparatus. In the graph we see points with measurement error. The point far away from the fitted line is likely to be a gross error. But then again it could be the first observation of that Nobel

19 1EC / 1 Summary so far... We have seen that using an exponent/mantissa representation (or scientific notation) leads to fractional errors. Suppose your calculator has a five digit mantissa, using decimal: E 13 is actually ( ± )E 13 Indeed any number is V (1± ) So, ROUNDOFF ERROR is a FRACTIONAL or RELATIVE ERROR. Having explored the IEEE 32-bit representation, we found that numbers V are actually V (1±2 24 ) = V (1± ). We also saw that there were several other sources of error, and the question arises of how to combine errors.

20 1EC / 1 Combining Errors When we combine quantities with error in the computation of an expression, the errors interact in a way that depends on the sizes of the quantities and errors, and the formula itself. So, for example, if x 1 and x 2 are in error by some amount, the error in x 1 x 2 is different from that in x 1 + x 2, and so on. How can we evaluate the error? Let us suppose that our quantities x 1, x 2, x 3,... are combined into a function f = f(x 1, x 2, x 3,...). Recall from your lectures on Partial Differentiation that an expression for the total or perfect differential of a function of several variables f = f(x 1, x 2, x 3,...) is ( ) ( ) ( ) f f f df = dx 1 + dx 2 + dx x 1 x 2 x 3 This is taken in the limit as dx 1 etc tend to zero.

21 1EC / 1 Correlated Errors If we relax this condition, and df δf, and dx 1 δx 1 and so on, then we have a 1st order Taylor expansion of the function of several variables. If we equate the δ s with the errors in the various quantities, this would give a way of assessing the size of error in a quantity f. But it assumes we know the relative signs of all the δx i s. That is, we know about the correlation between the quantities and their errors. For correlated errors. δf δx 1 ( f x 1 ) ( ) ( ) f f +δx 2 +δx x 2 x 3

22 1EC / 1 Example of Correlated Errors Daft example of Correlated Errors parallax

23 1EC / 1 For uncorrelated errors. If the errors are uncorrelated, we add the terms incoherently, by summing the squares of each term. This is also known as adding in quadrature. ( ) 2 ( ) 2 ( ) 2 f f f (δf) 2 (δx 1 ) 2 +(δx 2 ) 2 +(δx 3 ) x 1 x 2 x 3

24 1EC / 1 Example 1: Uncorrelated Errors in a Sum Summation. Suppose f = x 1 + x x N = N i=1 x i. Then each f/ x i = 1 and the formula for uncorrelated errors reduces to N (δf) 2 = (δx i ) 2. i=1 In other words, add the absolute errors in quadrature.

25 1EC / 1 Example 2: Uncorrelated Errors in a Product Products. Suppose f = N i=1 x i. Each f/ x i = f/x i, so for uncorrelated errors (δf) 2 = (δx 1 ) 2 ( f x 1 ) 2 +(δx 2 ) 2 ( f x 2 ) 2 +(δx 3 ) 2 ( f x 3 ) Divide through by f 2 and you ll find ( ) 2 δf N ( ) 2 δxi =. f i=1 x i In other words, add the fractional errors in quadrature.

26 1EC / 1 Example 3: Slightly more involved... You perform an experiment to measure the viscosity of custard. Viscosity given by V = Γ [ 1 4πΩ a 2 1 ] b 2 Custard θ Radius a Radius b Ω Perfect differential is: dv = dγ 4πΩ [ 1 a 2 1 b 2 = dγ Γ V dω Ω V + Γ 4πΩ What is the error in V given errors in Γ, Ω, a and b... ] Γ dω [ 1 4πΩ 2 [ 2da a 3 a 2 1 b 2 ] + Γ 4πΩ ] + Γ 4πΩ [ ] 2db b 3 [ 2da a 3 ] + Γ 4πΩ [ ] 2db b 3

27 1EC / 1 Viscosity of custard Let s assume uncorrelated errors! So sum the squares instead... ( ) 2 ( ) 2 ( ) 2 ] δγ δω Γ δa (δv) 2 = V 2 + V [ Γ Ω 4πΩ a 6 + δb2 b 6 Divide Left and Right by V 2 And so ( ) 2 δv = V ( δv V ) 2 = ( ) 2 δγ + Γ ( ) 2 δγ + Γ ( ) 2 δω + Ω [ ( ) 2 δa 2 δω + 4 Ω ] 2 a + δb2 6 b 6 [ 1 a 2 1 b 2 ] [ ( 4 δa (b 2 a 2 ) 2 b 4 a ) 2 ( ) ] 2 δb + a 4 b So, suppose the errors in Γ and Ω were each 1%. Let a be 2cm and b be 3cm, each known to 1%. Then (δv ) 2 = [ ] 10 4 (18) V 25 so the fractional error is

28 1EC / 1 Accumulation of roundoff errors Now suppose we performed some computation with N multiplications between numbers each with roundoff error ǫ P = x 1 x 2... x N Recall that the roundoff error is a relative or fractional error. If the errors are uncorrelated, we add the sums of the squares of the fractional errors. This gives the roundoff error in the product of ǫ P ǫ 2 +ǫ = Nǫ, where we assume that the roundoff error is similar in each number x i. However, the errors can easily be correlated Most obviously when computing x 3 1 using P = x 1 x 1 x 1. Then, using the earlier formula ǫ P ǫ+ǫ+... = Nǫ = 3ǫ

29 1EC / 1 Accumulation of roundoff errors /ctd In fact roundoff error is a little more unkind. The result is not only in error because the individual numbers are in error, but also because of roundoff error in the result itself. If N is large this doesn t matter much, but most often in computer code N = 2. Example using Product. Suppose we take the product of two numbers x 1 and x 2 with roundoff errors ǫ 1 and ǫ 2. Assume that the machine roundoff error is ǫ. Taking the worst case ie correlated errors we have the product Prod = x 1 (1+ǫ 1 )x 2 (1+ǫ 2 )+ǫx 1 x 2 x 1 x 2 (1+ǫ 1 +ǫ 2 +ǫ) The roundoff error in the product must is Same for DIVISION of course! ǫ 12 = Prod (x 1 x 2 ) 1 = ǫ 1 +ǫ 2 +ǫ.

30 1EC / 1 Roundoff error in Sums and differences of quantities Example using Sum. Now consider the sum with correlated errors Sum = x 1 (1+ǫ 1 )+x 2 (1+ǫ 2 )+ǫ( x 1 + x 2 ) Now the roundoff error is TYPO in the NOTES!! Sum 1 = x 1(1+ǫ 1 )+x 2 (1+ǫ 2 )+ǫ( x 1 + x 2 ) 1 x 1 + x 2 x 1 + x 2 Assuming that ǫ 1 1 and ǫ 2 1 Roundoff = ǫ ( x 1 + x 2 ) (x 1 + x 2 ) which is the machine roundoff error AMPLIFIED by ( x 1 + x 2 ) (x 1 + x 2 ) This is very bad news when we are subtracting similarly sized numbers because then x 1 x 2 and the amplification is LARGE!

31 1EC / 1 1e-07 1e-08 "out32" 1e-09 1e Practical example Consider finding the roots to a quadratic f = ax 2 + bx + c. The roots are at "out12" "out22" x 1,2 = ( b ± b 2 4ac)/2a When ac << b 2, the x 2 root should tend to c/b (an exercise for the reader) and hence approach zero. However the obvious code becomes prone to rounding error, as the jagged line in the plot shows. The modulus of the x 2 root plotted as a function of a. As a and c are reduced, the first routine shows the effects of rounding error. In the plot, b = 10 4, a = c and the abscissa is log 10 (a). The ordinate is log 10 ( x 2 ). 1: Quadroots1(a, b, c; x 1, x 2 ) BAD!! 2: x 1

32 1EC / 1 Modified code Also shown are the results for x 2 = c/b and for the routine taken from Numerical Recipes in C. Note that these latter results agree with each other they continue to do so for at least 20 more orders of magnitude of reduction in a and c. How small the code changes appear! 1e-07 Quadroots2(a, b, c; x 1, x 2 ) if (b < 0.0) then sign 1.0; else sign 1.0; end if q (b + sign Sqrt(b b 4.0 a c))/2.0; x 1 q/a; x 2 c/q; "out12" "out22" "out32" 1e-08

33 1EC / 1 And you thought the viscosity of custard was a joke Learn more about the viscosity of custard at and search on custard. Viscosity is an important determinant of custard quality. Two commercial samples of starch based custard powders were cooked on a hot plate according to their packet directions. Viscosity of Brand 1 was three times higher than that of Brand 2. The same brands were then tested in the RVA using milk and the following profile at the recommended concentrations (28.0 g total weight). Brand 1 gave a final viscosity about twice that of Brand 2 (Figure 1), reflecting the trend in the hot plate prepared samples.

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

$fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation$ Floating Point Arithmetic fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation for example, fixed point number