Computer Arithmetic. 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.

Similar documents
Floating-Point Arithmetic

Classes of Real Numbers 1/2. The Real Line

Floating-point representation

2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers

Binary floating point encodings

Introduction to floating point arithmetic

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Floating-Point Arithmetic

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

Computational Methods. Sources of Errors

Scientific Computing: An Introductory Survey

Computational Mathematics: Models, Methods and Analysis. Zhilin Li

Mathematical preliminaries and error analysis

CS321 Introduction To Numerical Methods

Numerical Computing: An Introduction

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

2.1.1 Fixed-Point (or Integer) Arithmetic

Review Questions 26 CHAPTER 1. SCIENTIFIC COMPUTING

Roundoff Errors and Computer Arithmetic

CS321. Introduction to Numerical Methods

Floating-Point Numbers in Digital Computers

Floating-Point Numbers in Digital Computers

Review of Calculus, cont d

Finite arithmetic and error analysis

Chapter 3. Errors and numerical stability

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

Most nonzero floating-point numbers are normalized. This means they can be expressed as. x = ±(1 + f) 2 e. 0 f < 1

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Scientific Computing: An Introductory Survey

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. What is scientific computing?

Floating Point Representation in Computers

Floating-Point Arithmetic

AMTH142 Lecture 10. Scilab Graphs Floating Point Arithmetic

Numerical Methods 5633

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

Lecture Objectives. Structured Programming & an Introduction to Error. Review the basic good habits of programming

Computational Economics and Finance

Computing Basics. 1 Sources of Error LECTURE NOTES ECO 613/614 FALL 2007 KAREN A. KOPECKY

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Floating-point numbers. Phys 420/580 Lecture 6

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

The Perils of Floating Point

Floating-point representations

Floating-point representations

Up next. Midterm. Today s lecture. To follow

Computational Economics and Finance

Hani Mehrpouyan, California State University, Bakersfield. Signals and Systems

ME 261: Numerical Analysis. ME 261: Numerical Analysis

Section 1.4 Mathematics on the Computer: Floating Point Arithmetic

1.2 Round-off Errors and Computer Arithmetic

Floating Point Numbers

Class 5. Data Representation and Introduction to Visualization

Objectives. look at floating point representation in its basic form expose errors of a different form: rounding error highlight IEEE-754 standard

ECE232: Hardware Organization and Design

Basics of Computation. PHY 604:Computational Methods in Physics and Astrophysics II

Scientific Computing. Error Analysis

Data Representation and Introduction to Visualization

Floating point numbers in Scilab

Outline. 1 Scientific Computing. 2 Approximations. 3 Computer Arithmetic. Scientific Computing Approximations Computer Arithmetic

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Floating Point Representation. CS Summer 2008 Jonathan Kaldor

Floating-point Error

CHAPTER 2 SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point Error Analysis. Ramani Duraiswami, Dept. of Computer Science

What Every Programmer Should Know About Floating-Point Arithmetic

Introduction to Computer Programming with MATLAB Calculation and Programming Errors. Selis Önel, PhD

Divide: Paper & Pencil

Numerical computing. How computers store real numbers and the problems that result

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Errors in Computation

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Floating Point Arithmetic

COMP2611: Computer Organization. Data Representation

Floating Point Numbers. Lecture 9 CAP

VHDL IMPLEMENTATION OF IEEE 754 FLOATING POINT UNIT

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

Introduction to Numerical Computing

Data Representation Floating Point

Reals 1. Floating-point numbers and their properties. Pitfalls of numeric computation. Horner's method. Bisection. Newton's method.

Floating Point Arithmetic

Floating-point operations I

Lecture 17 Floating point

Floating-Point Arithmetic

MATH 353 Engineering mathematics III

Data Representation Floating Point

Computer arithmetics: integers, binary floating-point, and decimal floating-point

Computing Integer Powers in Floating-Point Arithmetic

Computer Representation of Numbers and Computer Arithmetic

Introduction to numerical algorithms

Floating Point Numbers

Floating Point Numbers

On the definition of ulp (x)

Truncation Errors. Applied Numerical Methods with MATLAB for Engineers and Scientists, 2nd ed., Steven C. Chapra, McGraw Hill, 2008, Ch. 4.

IEEE Standard 754 Floating Point Numbers

The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive.

Lecture 10. Floating point arithmetic GPUs in perspective

Transcription:

ECS231 Handout Computer Arithmetic I: Floating-point numbers and representations 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.1416 10 1 sign significandbase exponent 2. Computers use binary representation of numbers. The floating-point representation of a nonzero binary number x is of the form (a) It is normalized, i.e., b 0 = 1 (the hidden bit) x = ±b 0.b 1 b 2 b p 1 2 E. (1) (b) Precision (= p) is the number of bits in the significand (mantissa) (including the hidden bit). (c) Machine epsilon ǫ = 2 (p 1), the gap between the number 1 and the smallest floatingpoint number that is greater than 1. (d) The unit in the last place, ulp(x) = 2 (p 1) 2 E = ǫ 2 E. If x > 0, then ulp(x) is the gap between x and the next larger floating-point number. If x < 0, then ulp(x) is the gap between x and the smaller floating-point number (larger in absolute value). 3. Special numbers: 0, 0,,, NaN(= Not a Number ). 4. All computers designed since 1985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) represent each number as a binary number and use binary arithmetic. Essentials of the IEEE standard: consistent representation of floating-point numbers by all machines adopting the standard; correctly rounded floating-point operations, using various rounding modes; consistent treatment of exceptional situation such as division by zero. 5. IEEE single format takes 32 bits (=4 bytes) long: 8 23 s E f sign exponent binary point fraction 1

It represents the number ( 1) s (1.f) 2 E 127 Note that (1) the leading 1 in the fraction need not be stored explicitly, because it is always 1. This hidden bit accounts for the 1. here. (2) The E 127 in the exponent is to avoid the need for storage of a sign bit. E is a normalized number, and E min = (00000001) 2 = (1) 10, E max = (11111110) 2 = (254) 10. The range of positive normalized numbers is from N min = 1.00 0 2 E min 127 = 2 126 1.2 10 38 to N max = 1.11 1 2 Emax 127 = (2 2 23 ) 2 127 2 128 3.4 10 38. Special repsentations for 0, ± and NaN: zero = ± 00000000 00000000000000000000000 ± = ± 11111111 00000000000000000000000 NaN = ± 11111111 otherwise 6. IEEE double format takes 64 bits (= 8 bytes) long: 11 52 s E f sign exponent binary point fraction It represents the numer ( 1) s (1.f) 2 E 1023 The range of positive normalized numbers is from N min = 2 1022 2.2 10 308 to N max = 1.11 1 2 1023 2 1024 1.8 10 308. Special repsentations for 0, ± and NaN. 7. IEEE extended format, with at least 15 bits available for the exponent and at least 63 bits for the fractional part of the significant. (Pentium has 80-bit extended format) 8. Precision and machine epsilon of the IEEE formats Format Precision p Machine epsilon ǫ = 2 p 1 single 24 ǫ = 2 23 1.2 10 7 double 53 ǫ = 2 52 2.2 10 16 extended 64 ǫ = 2 63 1.1 10 19 2

9. Rounding Let a positive real number x be in the normalized range, i.e., N min x N max, and be written in the normalized form x = (1.b 1 b 2 b p 1 b p b p+1...) 2 E, Then the closest floating-point number less than or equal to x is x = 1.b 1 b 2 b p 1 2 E, i.e., x is obtained by truncating. The next floating-point number bigger than x is x + = ((1.b 1 b 2 b p 1 )+(0.00 01)) 2 E, therefore, also the next one that bigger than x. If x is negative, the situtation is reversed. Correctly rounding modes: round down: round(x) = x ; round up: round(x) = x + ; round towards zero: round(x) = x of x 0; round(x) = x + of x 0; round to nearest: round(x) = x orx +, whicheverisnearertox, exceptthatifx > N max, round(x) =, and if x < N max, round(x) =. In the case of tie, i.e., x and x + are the same distance from x, the one with its least significant bit equal to zero is chosen When the round to nearest (IEEE default rounding mode) is in effect, and Therefore, we have abserr(x) = round(x) x 1 2 ulp(x) relerr(x) = round(x) x x the max. rel. representation error = 1 2 ǫ. 1 2 21 24 = 2 24 5.96 10 8 1 2 2 52 1.11 10 16. II: Floating point arithmetic 1. IEEE rules for correctly rounded floating-point operations: if x and y are correctly rounded floating-point numbers, then fl(x+y) = round(x+y) = (x+y)(1+δ) fl(x y) = round(x y) = (x y)(1+δ) fl(x y) = round(x y) = (x y)(1+δ) fl(x/y) = round(x/y) = (x/y)(1+δ) where δ 1 2ǫ for the round to nearest, IEEE standard also requires that correctly rounded remainder and square root operations be provided. 3

2. IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0 NaN Division by zero Finite nonzero/0 ± Overflow x > N max ± or ±N max underflow x 0, x < N min ±0, ±N min or subnormal Inexact whenever fl(x y) x y correctly rounded value 3. Let ˆx and ŷ be the floating-point numbers and that ˆx = x(1+τ 1 ) and ŷ = y(1+τ 2 ), for τ i τ 1 where τ i could be the relative errors in the process of collecting/getting the data from the original source or the previous operations. Question: how do the four basic arithmetic operations behave? 4. Addition and subtraction fl(ˆx+ŷ) = (ˆx+ŷ)(1+δ), δ 1 2 ǫ = x(1+τ 1 )(1+δ)+y(1+τ 2 )(1+δ) = x+y +x(τ 1 +δ +O(τǫ))+y(τ 2 +δ +O(τǫ)) ( = (x+y) 1+ x x+y (τ 1 +δ +O(τǫ))+ y ) x+y (τ 2 +δ +O(τǫ)) (x+y)(1+ ˆδ), where ˆδ can be bounded as follows: Three possible cases: ˆδ x + y x+y (τ + 1 2 ǫ+o(τǫ) ). (a) If x and y have the same sign, i.e., xy > 0, then x+y = x + y ; this implies Thus fl(ˆx+ŷ) approximates x+y well. ˆδ τ + 1 ǫ+o(τǫ) 1. 2 (b) If x y x+y 0, then ( x + y )/ x+y 1; this implies that ˆδ could be nearly or much bigger than 1. Thus fl(ˆx+ŷ) may turn out to have nothing to do with the true x + y. This is so called catastrophic cancellation which happens when a floating-point number is subtracted from another nearly equal floating-point number. Cancellation causes relative errors or uncertainties already presented in ˆx and ŷ to be magnified. (c) In general, if ( x + y )/ x+y is not too big, fl(ˆx+ŷ) provides a good approximation to x+y. 5. Examples of catastrophic cancellation Example 1. Computing n+1 n straightforward causes substantial loss of significant digits for large n 4

n fl( n+1) fl( n) fl(fl( n+1) fl( n) 1.00e+10 1.00000000004999994e+05 1.00000000000000000e+05 4.99999441672116518e-06 1.00e+11 3.16227766018419061e+05 3.16227766016837908e+05 1.58115290105342865e-06 1.00e+12 1.00000000000050000e+06 1.00000000000000000e+06 5.00003807246685028e-07 1.00e+13 3.16227766016853740e+06 3.16227766016837955e+06 1.57859176397323608e-07 1.00e+14 1.00000000000000503e+07 1.00000000000000000e+07 5.02914190292358398e-08 1.00e+15 3.16227766016838104e+07 3.16227766016837917e+07 1.86264514923095703e-08 1.00e+16 1.00000000000000000e+08 1.00000000000000000e+08 0.00000000000000000e+00 Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. In the present case, one can compute n+1 n almost to full precision by using the equality n+1 n = 1 n+1+ n. Consequently, the computed results are n fl(1/( n+1+ n)) 1.00e+10 4.999999999875000e-06 1.00e+11 1.581138830080237e-06 1.00e+12 4.999999999998749e-07 1.00e+13 1.581138830084150e-07 1.00e+14 4.999999999999987e-08 1.00e+15 1.581138830084189e-08 1.00e+16 5.000000000000000e-09 In fact, one can show that fl(1/( n+1+ n)) = ( n+1 n)(1+δ), where δ 5ǫ+O(ǫ 2 ) (try it!) Example 2. Consider the function Note that f(x) = 1 cosx x 2 = 1 2 ( ) sin(x/2) 2. x/2 0 f(x) < 1/2 for all x 0. Compare the computed values for x = 1.2 10 5 using the above two expressions (assume that the value of cosx rounded to 10 significant figures). 6. Multiplication and Division: fl(ˆx ŷ) = (ˆx ŷ)(1+δ) = xy(1+τ 1 )(1+τ 2 )(1+δ) xy(1+ ˆδ ), fl(ˆx/ŷ) = (ˆx/ŷ)(1+δ) = (x/y)(1+τ 1 )(1+τ 2 ) 1 (1+δ) xy(1+ ˆδ ), where ˆδ = τ 1 +τ 2 +δ +O(τǫ), ˆδ = τ 1 τ 2 +δ +O(τǫ). Thus ˆδ 2τ + 1 2 ǫ+o(τǫ) ˆδ 2τ + 1 2 ǫ+o(τǫ). Therefore, multiplication and Division are very well-behaved! 5

III: Floating point error analysis 1. Forward and backward error analysis We illustrate the basic idea through a simple example. Consider the computation of an inner product of two vector x, y R 3 x T y def = x 1 y 1 +x 2 y 2 +x 3 y 3, assuming already x i s and y j s are floating-point numbers. It is likely that fl(x y) is computed in the following order. fl(x T y) = fl(fl(fl(x 1 y 1 )+fl(x 2 y 2 ))+fl(x 3 y 3 )). Adopting the floating-point arithmetic model, we have fl(x T y) = fl(fl(x 1 y 1 (1+ǫ 1 )+x 2 y 2 (1+ǫ 2 ))+x 3 y 3 (1+ǫ 3 )) where ǫ i 1 2 ǫ and δ j 1 2 ǫ. = fl((x 1 y 1 (1+ǫ 1 )+x 2 y 2 (1+ǫ 2 ))(1+δ 1 )+x 3 y 3 (1+ǫ 3 )) = ((x 1 y 1 (1+ǫ 1 )+x 2 y 2 (1+ǫ 2 ))(1+δ 1 )+x 3 y 3 (1+ǫ 3 ))(1+δ 2 ) = x 1 y 1 (1+ǫ 1 )(1+δ 1 )(1+δ 2 )+x 2 y 2 (1+ǫ 2 )(1+δ 1 )(1+δ 2 ) +x 3 y 3 (1+ǫ 3 )(1+δ 2 ), Now there are two ways to interpret the errors in the computed fl(x T y): (a) We have fl(x T y) = x T y +E, where E = x 1 y 1 (ǫ 1 +δ 1 +δ 2 )+x 2 y 2 (ǫ 2 +δ 1 +δ 2 )+x 3 y 3 (ǫ 3 +δ 2 )+O(ǫ 2 ). It implies that E 1 2 ǫ(3 x 1y 1 +3 x 2 y 2 +2 x 3 y 3 )+O(ǫ 2 ) 3 2 ǫ x T y +O(ǫ 2 ). This bound on E tells the worst case difference between the exact x T y and its computed value. Such an error analysis is so-called Forward Error Analysis. (b) We can also write where 1 fl(x T y) = ˆx T ŷ = (x+ x) T (y + y), ˆx 1 = x 1 (1+ǫ 1 ), ŷ 1 = y 1 (1+δ 1 )(1+δ 2 ) y 1 (1+ ˆδ 1 ), ˆx 2 = x 2 (1+ǫ 2 ), ŷ 2 = y 2 (1+δ 1 )(1+δ 2 ) y 2 (1+ ˆδ 2 ), ˆx 3 = x 3 (1+ǫ 3 ), ŷ 3 = y 3 (1+δ 2 ) y 3 (1+ ˆδ 3 ). It can be seen that ˆδ 1 = ˆδ 2 ǫ+o(ǫ 2 ) and ˆδ 3 1 2ǫ. This says the computed value fl(x T y) is the exact inner product of a slightly perturbed ˆx and ŷ. Such an error analysis is so-called Backward Error Analysis. 1 There are many ways to distribute factors (1+ǫ i) and (1+δ j) to x i and y j. In this case it is even possible to make either ˆx x or ŷ y. 6

IV: Exercises 1. The polynomial p 1 (x) = (x 1) 6 has the value zero at x = 1 and is positive elsewhere. The expanded form of the polynomial p 2 (x) = x 6 6x 5 +15x 4 20x 3 +15x 2 6x+1, is mathematically equivalent. Plot p 1 (x) and p 2 (x) for 101 equally spaced points in the interval [0.995,1.005]. Explain the plots. (You should evaluate the polynomial p 2 (x) by Horner s rule). 2. The power series for sinx is sin(x) = x x3 3! + x5 5! x7 7! + Here is a Matlab function that uses this series to compute sinx. function s = powersin(x) % POWERSIN(x) tries to comptue sin(x) from a power series s = 0; t = x; n = 1; while s + t ~= s; s = s + t; t = -x.^2/((n+1)*(n+2)).*t; n = n + 2; end (a) What cases the while loop to terminate? (b) Answer each of the following questions for x = π/2,11π/2,21π/2 and 31π/2. How accurate is the computed results? How many terms are required? What is the largest term (i.e, the last t) in the series? (c) What do you conclude about the use of floating point arithmetic and power series to evaluate functions? 3. Show how to rewrite the following expressions to avoid catastrophic cancellation for the indicated arguments. (a) x+1 1, x 0. (b) sinx siny, x y. (c) x 2 y 2, x y. (d) (1 cosx)/sinx, x 0. (e) c = (a 2 +b 2 2ab cosθ) 1/2, a b, θ 1. 7

4. Consider the function f(x) = ex 1, x which arises in various applications. By L Hopital s rule, we know that lim f(x) = 1. x 0 (a) Compute the values of f(x) for x = 10 n, n = 1,2,...,16. Do your results agree with theoretical expectations? Explain why. (b) Perform the computation in part (a) again, this time using the mathematically equivalent formulation f(x) = ex 1 log(e x ), (evaluated as indicated with no simplification). If this works any better, can you explain why? V: Further reading 1. The following article based on lecture notes of Prof. W. Kahan of the University of California at Berkeley provides an excellent review of IEEE float point arithmetics. D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 18(1):5 48, 1991. 2. The following book gives a broad overview of numerical computing, with special focus on the IEEE standard for binary floating-point arithmetic. M. Overton. Numerical computing with IEEE floating-point arithemetic. SIAM, Philadelphia, 2001. ISBN 0-89871-482-6. Student price $20.00 directly from www.siam.org. 3. A classical book on error analysis, where the notion of backward error analysis is invented, is J.H. Wilkinson. Rounding Errors in Algebraic Process. Prentice-Hall, Englewood, NJ, 1964. Reprinted by Dover, New York, 1994. 4. A contemporary treatment of error analysis and its applications to numerical analysis is N.J. Higham, Accuracy and stability of Numerical Algorithms. second edition, SIAM, Philadelphia, 2002. 5. Websites for discussion of numerical disasters: T. Huckle, Collection of software bugs http://www5.in.tum.de/ huckle/bugse.html K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi211/disasters.html D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/ arnold/disasters/disasters.html 8