Introduction to numerical algorithms

Introduction to numerical algorithms Given an algebraic equation or formula, we may want to approximate the value, and while in calculus, we deal with equations or formulas that are well defined at each point and where the equations or formulas have properties such as continuity and differentiability, in engineering applications, such information is not always available. Instead, we may understand that the underlying behavior is continuous and differentiable, but we may only have samples of the values of the equation or formula. Our goal will be to find algorithms that will give us, under many circumstances, good approximations of solutions to an equation or the value of a formula. In this introductory chapter, we will look at: 1. techniques used in numerical algorithms, 2. sources of error, and 3. the representation of floating-point numbers. We will begin with Techniques in numerical algorithms Algorithms for numerical approximations to solutions of algebraic equations and formula generally use at least one of six techniques for finding such approximations, including: 1. iteration, 2. linear algebra, 3. interpolation, 4. Taylor series, 5. bracketing and 6. weighted averages. We will look at each of these six techniques, and while each is relatively straight-forward on their own, we will see that together solutions to some of the most complex algebraic equations and formulas can be computed. Iteration This section will first introduce the concept of iteration, then look at a straight-forward example, the fixed-point theorem, and it will conclude with a discussion of initial points for such iterations. Iteration Many numerical algorithms involve taking a poor approximation x k and from this finding a better approximation, x k+1, and this process can be repeated such that, under certain conditions and usually only in theory, the approximations will always get closer and closer to the correct answer. Problems with iterative approaches include: 1. the sequence converges, but very slowly, 2. the sequence converges to a solution that is not the one we are looking for, 3. the sequence diverges (approaches plus or minus infinity), or 4. the sequence does not converge. When we discuss the fixed-point theorem, we will see examples of each of these. Fixed-point theorem The easiest example of an iterative means of approximating a solution to an equation is finding a solution to

x = f(x) for any function f. In this case, if we start out with any initial approximation x 0 and the let x k+1 = f(x k ), then under specific circumstances, the fixed-point theorem says that sequence will converge to a solution of the equation x = f(x). Example 1 As an example, suppose we want to approximate a solution to the equation x = cos(x). The two solutions to this equation are approximately represented by 0.73908513321516064166. 2 2 Now, we know that 0.7853981635 and cos 0.7071067810, so let s start with x0, but as 4 4 2 2 computers cannot, we will start out with a 20-decimal-digit approximation, x 0 = 0.70710678118654752440, and so we begin: x cos x 1 0 cos 0.70710678118654752440 0.76024459707563015125 If we repeat this, we have the sequence of values presented here: x 0 0.70710678118654752440 x 1 0.76024459707563015125 x 2 0.72466748088912622790 x 3 0.74871988578948429789 x 4 0.73256084459224179590 x 5 0.74346421131529366888 x 6 0.73612825650085194340 x 7 0.74107368708371021764 x 8 0.73774415899257467163 x 9 0.73998776479587092315 x 10 0.73847680872455379036 x 11 0.73949477113197436584 x 12 0.73880913418406974579 x 13 0.73927102133010927466 x 14 0.73895990397625177601 x 15 0.73916948334137422989 x 16 0.73902831132627283515 x 17 0.73912340792986369997 x 18 0.73905935036556661907 x 19 0.73910250060713648065 x 20 0.73908514737681824432 The easiest way to observe this is to take any older generation calculator, randomly punch in any number, and then start hitting the cos key.

Example 2 If we try instead to approximate a solution to x = sin(x), we know the only solution to this equation is x = 0. If, however, we start with x 0 = 1.0, the first step looks hopeful, x 1 = sin(1.0) = 0.84147098480789650665, but we might rightfully have reason for concern if it takes nine iterations to get a value less than 0.5: x 0 1.00000000000000000000 x 1 0.84147098480789650665 x 2 0.74562414166555788889 x 3 0.67843047736074022898 x 4 0.62757183204915913888 x 5 0.58718099657343098933 x 6 0.55401639075562963033 x 7 0.52610707550284170222 x 8 0.50217067626855534868 x 9 0.48132935526234634496 In this case, the convergence is very slow; after 10000 iterations, our approximation is still x 10000 = 0.017313621122353677159 far less accurate than twenty iterations with the equation x = cos(x). Example 3 The equation x = e x 1 cos(x) has a solution at x = 0.92713713388001711273, but if we start with x 0 = 0.9271, we find that we actually converge to the other solution at x = 1.1132554312382240490. x 0 0.9271000000000000000 x 1 0.9270135853516848682 x 2 0.9267260909475873311 x 3 0.9257697888633520838 x 4 0.9225906690922362275 x 5 0.9120425734814909414 x 6 0.8772702797682118268 x 7 0.7650749295271742882 x 8 0.4278249398787326488 x 9 0.3759527882387797054 x 10 1.2435234699343957626 x 11 1.0330954581334626660 x 12 1.1562590805182166271 x 13 1.0881053100268105210 x 14 1.1273102952777548147 x 15 1.1051875733572243861 x 16 1.1178180735432224195 x 17 1.1106528702068170966 x 18 1.1147327749625783986 x 19 1.1124144944595464182 x 20 1.1137333605438615669 Example 4 If we take the same equation, x = e x 1 cos(x), but instead we start with x 0 = 0.9272, we find a different result: x 0 0.9272 x 1 0.9273463062488297850 x 2 0.9278331540644906092 x 3 0.9294536680953979168 x 4 0.9348530289767798115

x 5 0.9529024459698862581 x 6 1.0139056955772054701 x 7 1.2277962596009892473 x 8 2.0773844215152758665 x 9 7.4687566436661236317 x 10 1751.0506722832620736 x 11 2.9624055016857770800 10 760 Example 5 If we consider the equation x = 1 + cos(x) e x, we note that this equation has only one solution, approximately at x = 0.41010429603233999790; however, after 10000 iterations, the sequence neither converges to our solution, nor does it converge to any other solution, nor does it diverge to infinity. Instead, the values always remain bounded.

x 0 1.0000000000000000000 x 1 1.1779795225909055180 x 2 1.0748919751463956454 x 3 1.4538491475114394034 x 4 0.8830116594692874759 x 5 0.7833444047171225754 x 6 1.2516820387258261270 x 7 2.1824931015815898224 x 8 0.3129825494937142901 x 9 0.5839218156734147546 x 10 0.0412502764636847386 x 11 0.9570364387415620172 x 12 1.0280228200916764844 x 13 1.1587993460955800111 x 14 1.7856655640731183991 x 15 0.6190949071537133250 x 16 0.0428422865585101293 x 17 1.0410199321213100996 x 18 1.3267636954142391473 x 19 0.9762831542458109225 x 20 1.0944657282613892553 To demonstrate this more clearly, the following plot shows the first 2000 iterations together with the actual solution of x = 0.41010429603233999790. The approximations jump both above and below the value, but never converge to it. Example 6 Finally, if we consider the equation x = 3.5x(1 x), we note that this equation has two solutions, at x = 0 and approximately at x = 0.71428571428571428571; however, after twenty iterations, we note the points bounce between four values, none of which are either solution.

x 0 0.50000000000000000000 x 1 0.87500000000000000000 x 2 0.38281250000000000000 x 3 0.82693481445312500000 x 4 0.50089769484475255013 x 5 0.87499717950387996645 x 6 0.38281990377447189380 x 7 0.82694088767001590788 x 8 0.50088379589339714035 x 9 0.87499726616686585021 x 10 0.38281967628581869058 x 11 0.82694070106983887107 x 12 0.50088422294386791022 x 13 0.87499726352424938150 x 14 0.38281968322263632519 x 15 0.82694070675984845448 x 16 0.50088420992179774086 x 17 0.87499726360484968051 x 18 0.38281968301106208420 x 19 0.82694070658630209823 x 20 0.50088421031897331932 Convergence criteria If a sequence of points converges to a point x, it is necessary that lim x x 0, k k but for numerical solutions, we don t require the exact answer, only an approximation, and thus, we may desire that x x, k abs but even then we can t always guarantee the sequence will converge. Of course, we don t know when we re sufficnently close, because we don t know the actual value x, thus, we will look at x k 1 x k. abs Unfortunately, even this does not guarantee convergence, as we saw with the example with x = sin(x); however, but we are still quite far away from the solution: x10000 x9999 0.0000008651, x10000 0 0.01731. Where possible, we may require other convergence criteria. Initial values As two examples demonstrated, using slightly different initial values resulted in drastically different behaviours: one converged to the other solution, while the other diverged to infinity. In other cases, it can be shown that

iterative methods will converge, but only if the initial approximation is sufficiently close to the actual solution that is being sought out. In general, given an arbitrary iterative method, there is no conditions that tell you where to start. However, as an engineering student, when you use such techniques, you should already know approximately what the solution should be, and from that information, it should give you reasonable initial values and reasonable tests as to whether or not the approximation is the desired one. Summary of iteration As you may have noticed, iteration can be very useful in finding approximations to solutions of equations, but their use allows for many possible failures. Consequently, any Linear Algebra The next tool for solving algebraic equations is finding approximations to linear equations. Given a system of n linear equations in n unknowns, the objective is to find a solution that satisfies all n linear equations. In general, these are the only systems of equations that we can reliably solve, and therefore in many cases we will linearize an equation from non-linear equation to one that is linear, or from a system of non-linear equations on a system of linear equations. In solving the linear system, we hope that it will give us information to the solution of the nonlinear equations. In your course on linear algebra, you have already been exposed to Gaussian elimination. While this technique can be used to find numeric approximations of solutions to a system of linear equations, it is slow ( (n 3 ) for a system of n linear equations in n unknowns) and it is subject to round-off error, and if certain precautions are not taken, the approximation can have a significant error associated with it. There are iterative techniques for approximating solutions to systems of linear equations that are particularly effective for large sparse systems. Interpolation Given a set of n points (x 1, y 1 ),, (x n, y n ), if all the x values are different, there exists a polynomial of degree n 1 that passes through all n points. This technique will often be used to convert a set of n observations into a continuous function. Taylor series A Taylor series describes the behavior of a function by Taylor series will be used primarily for error analysis, although with techniques such as automatic differentiation (where the derivative of a Matlab or C function can be deduced algorithmically), it is possible use Taylor series in numerical computations. As an example of automatic differentiation, from the C function double f( double x, double y ) { return 1.0 + x + x*(x*x - x*y*sin(x)); } it could be deduced that the partial derivatives are double f_x( double x, double y ) { return 1.0 + x*x - x*y*sin(x) + x*(2*x - y*sin(x) - x*y*cos(x)); } double f_y( double x, double y ) { return -x*x*sin(x); } These could then be compiled and called directly.

Also, given a set of n points (x 1, y 1 ),, (x n, y n ), if we allow the x values to converge on a single point (again, without any repetition except in the limit), the limit of the interpolating polynomials will be the (n 1) th -order Taylor series approximation of the function at that limit point. Bracketing In some cases, it is simply not possible to use interpolation or Taylor series to find approximations to equations. In such cases, it may be necessary to revert to the intermediate-value theorem. For example, if we are attempting to approximate a root of a function f(x) and we know that f(x 1 ) < 0 and f(x 2 ) > 0, if the function is continuous, there must be a root on the interval [x 1, x 2 ]. If we let root is in [x 1, x 3 ] or [x 3, x 2 ]. x x x 2 1 2 3, then the sign of f(x 3 ) will let us know whether or not a Weighted averages Finally, another approach to finding numerical approximations is to use weighted averages. A simple average of n values is the sum of those values divided by n x x x 1 2 n n, but a simple average may not always be the best approximation of a value in question. In some cases, we may have a number of weights c 1,, c n where then c1 c2 c n 1, c1 x1 c2x2 cnxn 1 is a weighted average of the n x values. When c1 c2 c n, the weighted average is the average. n As an example, suppose we wanted to approximate the average value of the sine function on [1.0, 1.2] with three function evaluations. One solution may be to calculate however, the weighted average sin 1.0 sin 1.1 sin 1.2 3 sin 1.0 2sin 1.1 sin 1.2 4 0.88823914361218606542, 0.88898119772449838406 (here c 1 = 0.25, c 2 = 0.5 and c 3 = 0.25) is closer to the actual average value 1.2 1 sin x dx 0.88972275695733069880 0.2. 1.0 You may actually notice the error is almost exactly half (0.00148361 versus 0.00074156 ). We will see later why there are good theoretical reasons for what may appear to be a coincidence.

Summary of numerical techniques In summary, there are six techniques that we will be using find numeric approximations to algebraic equations and formulas. Every technique will use at least one of these techniques, and often more. Next, we will look at the source of errors. Sources of error One source of error in numerical computations is rounding error, and this manifests itself in two ways: 1. Certain numbers, such as, have non-terminating non-repeating decimal representation. Such numbers cannot be stored exactly. 2. The result of many arithmetic operations, including most divisions, anything but integer multiplications, the sum of numbers with very different magnitudes, and the subtraction of very similar numbers, will either result in additional rounding errors, or amplify the effect of previous rounding errors. As an example, suppose we want to calculate the average of two values that are approximately equal: the most obvious solution is to calculate a b c, 2 but what happens if a + b results in a numeric overflow? If we assume b > a, then while b a c a 2 is algebraically equivalent to the straight-forward calculation, this formula is not subject to numeric overflow. There are other sources of error: 1. the values used may themselves be subject to error: a sensor may only be so precise, or the sensor itself could be subject to a bias (always 2. an incorrect model Representation of numbers This section will briefly describe the various means of storing numbers, including: 1. representations of integers, 2. floating-point representations of real numbers, 3. fixed-point representations of real numbers, and 4. the representation of complex numbers. This course will focus on the second: the floating-point representation of real numbers, but we will at least introduce the other three. Base 2 in favour of base 10 From early childhood, we have learned to count to 9, and having maxed out the number of digits available, we proceed to writing 10. This is referred to as base 10, as there are ten digits, 0, 1, 2,, 8 and 9. It would be possible to have a computer store a base-10 number using, for example, 10 different voltages, but it is easier to use just two voltages, thereby allowing only two digits: 0 and 1. Thus, 0 and 1 represent the first two numbers, but the next must be 10, after which we have 11, and then 100.

Thus, 10 represents two, 11 three, 100 four, and so on. The first seventeen numbers are shown in this table. Decimal Binary 0 0 1 1 2 10 3 11 4 100 5 101 6 110 7 111 8 1000 9 1001 10 1010 11 1011 12 1100 13 1101 14 1110 15 1111 16 10000 To differeniate between base-10 numbers ( decimal numbers ) and base-2 numbers ( binary numbers ), if the possibility of ambiguity exists, the base is appended as a subscript, so 11 10 = 1011 2. You may wonder whether this is efficient, as it takes 5 digits to represent sixteen, whereas base 10 only requires two digits. The additional memory, however, is only a constant multiple: it requires approximately log 2 10 3.3 times as many binary digits ( bits ) as it does require decimal digits to represent the same number. Thus, while one million may be represented with seven digits, it requires 20 bits (1000000 10 = 11110100001001000000 2 ). Examples of binary operations are presented here, always remember that binary arithmetic is just like decimal arithmetic, only 1 + 1 = 10 and 11 + 1 = 100, etc. 1 1 1 1000111001 10010101 1011001110 1 1 0 1 1000111001 10010101 110100100

1000111001 10010101 1000111001 00000000000 100011100100 0000000000000 10001110010000 000000000000000 0000000000000000 10001110010000000 10100101100101101 11.11010001 10010101 1000111001. 00 000000 10010101 100001111 10010101 11110100 10010101 10111110 10010101 10100100 10010101 11110000 10010101 1011011 To convert a decimal number into a binary number is tedious and the reader is welcome to look this topic up on his or her own; however, we will make one comment: just because a number has a finite representation in base 10, such as 0.3, this does not mean that the binary representation will also be finite. The conversion of a number from binary to decimal is quite straight forward. Recall that the decimal integer represents the number d n d 1 d 0 n k dk 10, k 0 as 5402 is 5000 + 400 + 00 + 2. Similarly, each bit corresponds with a power of two, and for integers, if the bits are numbered as represents the number b n b 1 b 0

n k b 2, k k 0 so 1101 2 is 8 + 4 + 0 + 1 = 13. This also works for real numbers, where 0.1 2 represents 2 1 or 0.5, and 0.01 2 represents 0.25, and so on. Thus, 101010.0010111 2 represents 32 + 8 + 2 + 0.5 + 0.125 + 0.03125 + 0.015625 + 0.0078125 = 42.1796875. Representations of integers Integers are generally stored in computers as an n-bit unsigned integer capable of storing values from 0 to 2 n 1 or as a signed integer using 2 s-complement capable of storing values from 2 n 1 to 2 n 1 1. In general, positive numbers are always stored using a base-2 representation, where the k th bit represents the coefficient of 2 k of the binary expansion of the number. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 For example, 0000000000101010 represents 2 1 + 2 3 + 2 5 = 42. Note, however, if a system is little endian (discussed in Section Error! Reference source not found.), such a 16-bit binary representation would be stored in main memory as 0010101000000000. The 2 s-complement representation storing both positive and negative integers is as follows: Given n bits, 1. if the first bit is 0, the remaining n 1 bits represent integers from 0 to 2 n 1 1 using a base-2 representation, wihle 2. if the first bit is 1, the remaining n 1 bits represent (2 n 1 b) where b is the positive integer of the remaining n 1 bits storing a number from 0 to 2 n 1 1, so negative numbers range from 2 n 1 to 1. The easiest way to calculate the representation of a negative integer is to take the bit-wise NOT (complement) of the positive number from 1 to 2 n 1 (from 000 001 to 100 000), taking the bit-wise complement (from 111 110 to 011 111) and adding 1 to the result (from 111 111 to 100 000). Note this forces the first bit to be 1. For example, given the 16-bit representations of 42 10 = 101010 2, the 16-bit 2 s-complement representation of 42 is x = 0000000000101010 ~x = 1111111111010101 ~x + 1 = 1111111111010110 All positive integers have a leading 0 and all negative numbers have a leading 1. Incidentally, the largest negative number is 1000000000000000 while the representation of 0 is 1111111111111111. If you ask most libraries for the absolute value of the largest negative number, it comes back unchanged a negative number. The most significant benefit of the 2 s-complement representation is that addition does not require additional checks. For example, we can find 42 + 10 by calculating: 1111111111010110 + 1010 1111111111100000 This result is negative (the first bit is 1), and thus we calculate the additive inverse of the result: y = 1111111111100000

~y = 0000000000011111 ~y + 1 = 0000000000100000 That is, the sum is 32. While there have previously been other digital formats (for example, binary-coded decimal), these representations for positive integers and signed integers are almost universal today. One issue with integer representations is what happens if the result of an operation cannot be within that representation. For example, suppose we add 1 to the largest signed integer (say, 1 + 0111111111111111). There are two approaches: 1. The most common is to wrap and signal an overflow, so the result is 1000000000000000 which is now the largest negative integer. Most high-level programming languages do not allow the programmer to determine if an overflow has occurred, and therefore it is necessary that checks are made before an operation is made to determine if an overflow will occur. 2. The second is referred to as saturation arithmetic, where, for example, adding one to the largest integer will have the largest integer return. This was discussed previously in Section Error! Reference source not found. with the QADD operation. One operation that must, however, be avoided at all costs is a division-by-zero or modulo zero operation. Such operations will throw an interrupt that will halt the currently executing task. The Clementine lunar mission that failed, in part due to the absence of a watchdog timer, had a second peculiarity: prior to the exception that caused the processor to hang, there had been previously almost 3000 similar exceptions. See Jack Ganssle s 2002 article Born to Fail for further details. In summary, fixed-length base-2 representations of positive integers and 2 s-complement representation of negative numbers are near universal. Most applications use usual arithmetic while checking for overflow; however, saturation arithmetic may be more appropriate in critical systems where an accidental overflow may result in a disaster (as in the Ariane 5 rocket). Allowing exceptions to result from invalid integer operations has also resulted in numerous issues, too. Floating-point representations Real numbers are generally approximated using floating- or fixed-point representations. We say approximated because almost every real number cannot be represented exactly using any finite-length representation. Floating-point approximations usually use one of two representations specified by IEEE 754: single- and doubleprecision floating point numbers, or float and double, respectively. For general applications, double-precision floating-point numbers, which occupy eight bytes, have sufficient precision to be used for most engineering and scientific computation, while single-precision floating-point numbers occupy only four bytes, and have significantly less precision, and therefore should only be used when only course approximations are necessary, such as in the generation of graphics. In embedded systems, however, if it can be determined that the higher precision of the double format is not necessary, use of the float format can result in significant savings in memory and run time. Most larger microcontrollers have floating-point units (FPUs) which perform floating-point operations. Issues with floating-point operations such as those associated with integer operations are avoided with the introduction of three special floating-point numbers representing infinity, negative infinity and not-a-number. These numbers result from operations such as 1.0/0.0, -1e300*1e300 and 0.0/0.0, respectively. Consequently, there will never be an exception in any floating-point operation. Note that even zero is signed, where +0 and -0 represents all positive and negative real numbers, respectively, too small to be represented by any other floatingpoint number. Therefore, 1.0/(-0.0) should result in negative infinity.

For further information on floating-point numbers, see any good text on numerical analysis. Fixed-point representations Fixed-point representation of real numbers is usually restricted to smaller mircocontrollers that lack an FPU, often with only 24- or 16-bit registers or smaller. In a fixed-point representation, the first bit is usually the sign bit, and the radix point is arbitrarily fixed at some location within the number. Thus, if an 16-bit number represented a sign bit, 7 bits before the integer component, and 8 bits for the fractional component, the value of would be represented by 0000001100100100 which is the approximation 11.001001 2 = 3.140625 10 with a 0.0308 % relative error. This can represent real numbers in the range ( 256, 256). Adding two fixed-point representations can, for the most part, be done with integer addition, but multiplication requires a little more effort, requiring integer multiplication of the 16-bit numbers as 32-bit numbers, and then truncating the last 8 bits. 11.00100100 11.00100100 1001.1101110100010000 Thus, 2 is approximately equal to 1001.11011101 2 = 9.86328125 10, whereas 2 = 9.86960440. Whether or not numbers like 0111111111111111 and 1111111111111111 represent plus or minus infinity is a question that must be addressed during the design phase. Representation of complex numbers The most usual means of representing a complex number is to store a pair of real numbers representing the real and imaginary components of the complex number. Fortunately, Matlab allows you to work seamlessly with complex numbers. By default, the variables i and j both represent the imaginary unit, 1, but even if you assign to these variables, you may always enter a complex number by juxtaposing the imaginary unit with the imaginary component: >> j = 4; >> 3.25 + 4.79j ans = 3.2500 + 4.7900i Indeed, Matlab recommends using 1j instead of j for entering the imaginary unit (to avoid the possibility that your code may at some point later fail if j is accidently assigned a value earlier in your scripts). While it is possible to store complex number as a pair of real numbers representing its magnitude and argument, this is seldom used in practice except in special circumstances. Summary of the representation of numbers In this section, we have reviewed or introduced various binary representation of integers and real numbers. Each representation must have some limitations and developers of real-time systems must be aware of those limitations. We will continue with the introduction of definitions related to real-time systems.

Summary of our introduction to numerical algorithms This first chapter discussed techniques that will be used in numerical algorithms, including iteration, linear algebra, interpolation, Taylor series, bracketing and weighted averages; a brief discussion on the source of error; and the representation of numbers.