Floating-point operations I

Size: px

Start display at page:

Download "Floating-point operations I"

Gary Taylor
5 years ago
Views:

1 Floating-point operations I The science of floating-point arithmetics IEEE standard Reference What every computer scientist should know about floating-point arithmetic, ACM computing survey, 1991 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 1 / 87

2 Why learn more about floating-poing operations I Example: A one-variable problem min f (x) x x 0 In your program, should you set an upper bound of x x in your program may be wrongly increased to What is the largest representable number in the computer? Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 2 / 87

3 Why learn more about floating-poing operations II Is there anything called infinity? Example: A ten-variable problem min f (x) 0 x i, i = 1,..., 10 After the problem is solved, want to know how many are zeros? Should you use Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 3 / 87

4 Why learn more about floating-poing operations III for (i=0; i < 10; i++) if (x[i] == 0) count++ ; People said: don t do floating-point comparisons epsilon = 1.0e-12 ; for (i=0; i < 10; i++) if (x[i] <= epsilon) count++ ; How do you choose ɛ? Is this true? Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 4 / 87

5 Floating-point Formats I We know float (single): 4 bytes, double: 8 bytes Why? A floating-point system base β, precision p, significand (mantissa) d.d... d Example 0.1 = (β = 10, p = 3) (β = 2, p = 5) exponent: 1 and 4 Largest exponent e max, smallest e min Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 5 / 87

6 Floating-point Formats II β p possible significands, e max e min + 1 possible exponents log 2 (e max e min + 1) + log 2 (β p ) + 1 bits for storing a number 1 bit for ± But the practical setting is more complicated See the discussion of IEEE standard later Normalized: (yes), (no) Now most used normalized representation cannot represent zero Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 6 / 87

7 Floating-point Formats III A natural way for 0: 1.0 β e min 1 preserve the ordering Will use p = 3, β = 10 for most later explanation Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 7 / 87

8 Relative Errors and Ulps I When β = 10, p = 3, represented as error = , i.e units in the last place 10 2 : unit of the last place ulps: unit in the last place relative error / For a number d.d... d β e, the largest error is 0. } 0.{{.. 0} β β e, β = β/2 p 1 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 8 / 87

9 Relative Errors and Ulps II Error = β 2 β p β e 1 β e value β β e relative error between β 2 β p β e /β e and β 2 β p β e /β e+1, relative error β 2 β p (1) β 2 β p = β p+1 /2: machine epsilon The bound in (1) When a number is rounded to the closest, relative error bounded by ɛ Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 9 / 87

10 ulps and ɛ I p = 3, β = 10 Example: x = x = error = 0.05 = ulps = , ɛ = = error 0.5 ulps relative error 0.05/ = 0.8ɛ 8x = 98.8, 8 x = error = 4.0 ulps relative error = 0.4/98.8 = 0.8ɛ. ulps and ɛ may be used interchangeably Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 10 / 87

11 Guard Digits I p = 3, β = 10 Calculate : Compute and then round x = y = x y = round to Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 11 / 87

12 Guard Digits II Round and then compute x = y = x y = Answer is the same OK as x x y Another example: = 0.17 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 12 / 87

13 Guard Digits III Round and then compute = = = 0.03 ulps = = 10 3 error = 0.03 = 30ulps Relative error = 0.03/0.17 = 3/17. The error is quite large Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 13 / 87

14 Guard Digits IV Compute and round = 0.17 = error = 0 The problem: cannot compute and then round How big can the error be? (if round and then compute) Theorem Using p digits with base β, the relative error can be as large as β 1 Proof: Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 14 / 87

15 Guard Digits V x = , y =.η... η, η = β 1 (p digits) x y = β p, computed solution = β p+1 Relative error = β p β p+1 β p = β 1 Example: p = 3, β = 10 x = 1.00, y = 0.999, x y = = 10 3 Computed solution = = = 0.01 Relative error = 9 Such large errors occur if x and y are close Single guard digit Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 15 / 87

16 Guard Digits VI p increased by 1 in the device for addition and subtraction round and then compute = Note = can be stored as p = 3 One additional digit for subtraction. All values still stored using p = 3 So in the device for subtraction, we should put additional digits Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 16 / 87

17 Guard Digits VII Another example: = = Correct answer Relative error around Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 17 / 87

18 Guard Digits VIII Theorem ɛ = 1 2 β p+1 = = Using p + 1 digits for x y relative rounding error < 2ɛ (ɛ: machine epsilon) Proof: Assume x > y Assume x = x 0.x 1 x p 1 β 0 (why?) If y = y 0.y 1 y p 1 no error If y = 0.y 1 y p 1 guard digit, exact x y Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 18 / 87

19 Guard Digits IX rounded to a closest number relative error ɛ In general y = 0.0 0y k+1 y k+p ȳ: y truncated to p + 1 digits y ȳ < (β 1)(β p 1 + β p β p k ) β p p 1: we have p + 1 digits now (Think about p = 3, β = 10, first digit truncated = ) x ȳ, rounded to x ȳ + δ Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 19 / 87

20 Guard Digits X δ (β/2)β p = ɛ error: (x y) (x ȳ + δ) = ȳ y δ case 1: if x y 1, relative error = ȳ y + δ ȳ y δ x y 1 β p [(β 1)(β β k ) + β/2] < β p (1 + β/2) 2ɛ case 2: x ȳ < 1: enough digits δ = 0 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 20 / 87

21 Guard Digits XI the smallest x y: (smallest x - largest y) ρ... ρ > (β 1)(β 1 + β k ) k zeros, p for ρ, ρ = β 1, the relative error ȳ y δ (β 1)(β β k ) < (β 1)β p (β β k ) (β 1)(β β k ) = β p < 2ɛ case 3: x y < 1 but x ȳ 1 If x ȳ = 1 δ = 0: use case 2 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 21 / 87

22 Guard Digits XII If x ȳ = x y 1: a contradiction Why x y must be 1: y ȳ < β p Conclusion: adding some guard digits can reduce the error Especially when subtracting two nearby numbers Cost: the adder one bit wider (cheap) Most modern computers have guard digits Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 22 / 87

23 Cancellation I Catastrophic cancellation and benign cancellation Catastrophic cancellation : b = 3.34, a = 1.22, c = 2.28, b 2 4ac = b , 4ac 11.1 answer = 0.1 error = = answer = = ulps = = ulps Happens when subtracting two close numbers Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 23 / 87

24 Cancellation II Benign cancellation: subtracting exactly known numbers, by guard digits small relative error In the example, b 2 and 4ac already contain errors Avoid catastrophic cancellation by rearranging formula Example b + b 2 4ac (2) 2a b 2 4ac no cancellation when calculating b 2 4ac and b 2 4ac b Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 24 / 87

25 Cancellation III b + b 2 4ac has a catastrophic cancellation if b > 0 Multiplying b b 2 4ac, if b > 0 2c b b 2 4ac (3) Use (2) if b < 0, (3) if b > 0 Difficult to remove all catastrophic cancellations, but possible to remove most by reformulations Another example: x 2 y 2 Assume x y Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 25 / 87

26 Cancellation IV (x y)(x + y) is better than x 2 y 2 x 2, y 2 may be rounded x 2 y 2 may be a catastrophic cancellation x y by guard digit A catastrophic cancellation is replaced by a benign cancellation Of course x, y may have been rounded and x y is still a catastrophic cancellation. Again, difficult to remove all catastrophic cancellations, but possible to remove some Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 26 / 87

27 Cancellation V Calculating area of a triangle A = s(s a)(s b)(s c), s = a + b + c 2 (4) a, b, c: length of three edges If a b + c, s = (a + b + c)/2 a, s a may have a catastrophic error Example: a = 9.00, b = c = 4.53 s = 9.03, A = Computed solution: A = 3.04, error 0.7 ulps = 0.01, error = 70 ulps Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 27 / 87

28 Cancellation VI A new formulation by Kahan [1986], a b c A = (a + (b + c))(c (a b))(c + (a b))(a + (b c) 4 (5) A 2.35, close to HW 1-1: Calculate A = 3.04 using (4) and A = 2.35 using (5) Conclusion: sometimes a formula can be rewritten to have higher accuracy using benign cancellation Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 28 / 87

29 Cancellation VII Only works if guard digit is used; most computers use guard digits now But reformulation is difficult!! You may think that you will never need to do this Two real cases: Line of tron.cpp of LIBLINEAR http: // HW1-2: Check Eq. (13) of the paper logistic.pdf Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 29 / 87

30 Cancellation VIII and explain how we avoid catastrophic cancellations Probability outputs of LIBSVM HW1-3: Repeat the experiment on page 5, line 12 of the paper plattprob.pdf Discuss what you found Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 30 / 87

31 Exactly Rounded Operations I Round then calculate may not be very accurate Exactly rounded: compute exactly then rounded to the nearest usually more accurate The definition of rounding or 13 rounding up: 0, 1, 2, 3, 4 down, 5, 6, 7, 8, 9 up Rounding even: 5 up if the previous digit is even, down otherwise 50% probability up, 50% down Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 31 / 87

32 Exactly Rounded Operations II example: Reiser and Knuth [1975] shows rounding even may be better Theorem Let x 0 = x, x 1 = (x 0 y) y,..., x n = (x n 1 y) y, if and are exactly rounded using rounded to even, then x n = x, n or x n = x 1, n 1. x y: computed solution Consider rounding up, β = 10, p = 3, x = 1.00, y = Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 32 / 87

33 Exactly Rounded Operations III x y = 1.555, x y = 1.56, (x y) + y = = 1.005, x 1 = (x y) y = 1.01 x 1 y = 1.565, x 1 y = 1.57, (x 1 y) + y = = 1.015, x 2 = (x 1 y) y = 1.02 Increased by 0.01 until x n = 9.45 Round even: x y = 1.555, x y = 1.55, (x y) + y = = 0.995, x 1 = (x y) y = x 1 y = 1.55, x 1 y = 1.55, (x 1 y) + y = = 0.995, x 2 = (x 1 y) y = How to implement exactly rounded operations Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 33 / 87

34 Exactly Rounded Operations IV can use an array of words or floating-points But you don t have an infinite amount of spaces Goldberg [1990] showed using 3 guard digits the result is the same as using exactly rounded operations Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 34 / 87

35 IEEE standard I IEEE 754 during 80s, now standard everywhere Two IEEE standards: 754: specify β = 2, p = 24 for single, β = 2, p = 53 for double 854 (β = 2 or 10, does not specify how floating-point numbers are encoded into bits) Why IEEE 854 allows β = 2 or 10 but not other numbers: 10 is the base we use smaller β causes smaller relative error Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 35 / 87

36 IEEE standard II smaller β: more precision e.g. β = 16, p = 1 vs. β = 2, p = 4 4 bits for significand ɛ = = 1/2, ɛ = = 1/16 Why IBM/370 uses β = 16? two possible reasons: a number: 4 bytes = 32 bits β = 16, p = 6, significand: 4 6 = 24 bits, exponents: = 7 bits (1 bit for sign), to = 2 28 for β = 2 9 bits ( 2 8 to 2 8 = 2 9 ) for exponents, = 22 for significand The same exponents, less significand (24 vs. 22) Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 36 / 87

37 IEEE standard III Shifting: β = 16, less frequently to adjust exponents when adding or subtracting two numbers For modern computers, this saving is not important Single precision: β = 2, p = 24 (23 bits as normalized), exponent 8, 1 bit for sign (32 = ) An example: = of 1. is not stored (normalized) Biased exponent (described later in detail) Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 37 / 87

38 IEEE standard IV = = 134, = 7 A summary IEEE Fortran C Bits Exp. Mantissa Single REAL*4 float Single-extended Double REAL*8 double Double-extended REAL*10 long double = but : Hardware implementation of extended precision normal don t use a hidden bit Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 38 / 87

39 IEEE standard V (Remember we normalized each number so 1 is not stored) It seems everyone is using double now But single is still needed sometime (if memory is not enough) Minimal normalized positive number bits for exponent: 0 to 255 IEEE uses biased approach exponent = (0 to 255) = -127 to 128 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 39 / 87

40 IEEE standard VI However, e min = 126, e max = 127 reasons: 1/2 e min not overflow, 1/2e max underflow, but less serious Thus, -127 for 0 and denormalized numbers (discussed later), -126 to 127 for exponents, 128 for special quantity Motivation for extended precision: from calculator, display 10 digits but 13 internally Some operations benefit from using more digits internally Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 40 / 87

41 IEEE standard VII Example: binary-decimal conversion (Details not discussed here) Operations: IEEE standard requires results of addition, subtraction, multiplication and division exactly rounded. Exactly rounded: an array of words or floating-point numbers, expensive Goldberg [1990] showed using 3 guard digits the result is the same as using exactly rounded operations Only little more cost Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 41 / 87

42 IEEE standard VIII Reasons to specify operations run on different machines results the same HW 2-1: write the binary format of -300 as a double floating-point number IEEE: square root, remainder, conversion between integer and floating-point, internal formats and decimal are correctly rounded (i.e. exactly rounded operations) Binary to decimal conversion Think about reading numbers from files When writing a binary number to a decimal number Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 42 / 87

43 IEEE standard IX Then read it back, can we get the same binary number? Writing 9 digits is enough for short Though 10 8 > 2 24, 8 digits are not enough 17 for double precision, example: numbers from Matrix market: > tail s1rmq4m1.dat E E E E Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 43 / 87

44 IEEE standard X E E Matrix market: A collection of matrix data Transcendental numbers: e.g., exp, log IEEE does not require transcendental functions to be exactly rounded Cannot specify the precision because they are arbitrarily long Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 44 / 87

45 Special quantities I On some computers (e.g. IBM 370) every bit pattern is a valid floating-point number For IBM 370, 4 = 2 printing an error message IEEE : NaN, not a number why 4 = 2 every pattern is a number Special value of IEEE: +0, 0, denormalized numbers, +,, NaNs (more than one NaN) A summary Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 45 / 87

46 Special quantities II Exponent significand represents e = e min 1 f = 0 +0,-0 e = e min 1 f 0 0.f 2 e min e min e e max 1.f 2 e e = e max + 1 f = 0 ± e = e max + 1 f 0 NaN Why IEEE has NaN Sometimes even 0/0 occurs, the program can continues Example: find f (x) = 0, try different x s, even 0/0 happens, other values may be ok. Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 46 / 87

47 Special quantities III If b 2 4ac < 0 b + b 2 4ac 2a returns NaN b+ NaN should be NaN In general when a NaN is in an operation, result is NaN Examples producing NaN: Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 47 / 87

48 Special quantities IV Operation NaN by + + ( ) 0 / 0/0, / REM x REM 0, REM y x when x < 0 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 48 / 87

49 Infinity I β = 10, p = 3, e max = 98, x = , x 2 overflow and replaced by ?? In IEEE, the result is Note 0/0 = NaN, 1/0 =, 1/0 = nonzero divided by 0 is or Similarly, 10/0 =, and 10/ 0 = + (±0 will be explained later) 3/ = 0, 4 =, = replace with x, let x Example: 3/ : lim x 3/x = 0 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 49 / 87

50 Infinity II If limit not exists NaN x/(x 2 + 1) vs 1/(x + x 1 ) x/(x 2 + 1): if x is large, x 2 overflow, x/ = 0 but not 1/x. 1/(x + x 1 ): x large, 1/x ok 1/(x + x 1 ) looks better but what about x = 0? x = 0, 1/( ) = 1/(0 + ) = 1/ = 0 If no infinity arithmetic, an extra instruction needed to test if x = 0, may interrupt the pipeline Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 50 / 87

51 Signed zero I Why do we have +0 and -0? First, it is available (1 bit for sign) if no sign, 1/(1/x)) = x fails when x = ± x =, 1/x = 0, 1/0 = + x =, 1/x = 0, 1/0 = + Compare +0 and 0: if (x == 0) IEEE defines +0 = 0 IEEE: 3 (+0) = +0, +0/( 3) = 0 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 51 / 87

52 Signed zero II For underflow log x { x = 0 NaN x < 0 A small underflow negative number log x should be NaN x underflow round to 0, if no sign, log x is but not NaN Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 52 / 87

53 Signed zero III With ±0, we have x = +0 log x = NaN x = 0 NaN x < 0 Positive underflow round to +0 Very useful in complex arithmetic 1/z and 1/ z z = 1, 1/ 1 = 1 = i, 1/ 1 = 1/i = i 1/z 1/ z Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 53 / 87

54 Signed zero IV Square root is multi-valued. i 2 = ( i) 2 = 1 However, by some restrictions (or ways of calculation), they can be equal z = 1 = 1 + 0i, 1/z = 1/( 1 + 0i) = 1 + ( 0)i so 1/z = 1 + ( 0)i = i 0 is useful Disadvantage of +0 and 0: x = y 1/x = 1/y is destroyed x = 0, y = 0 x = y under IEEE Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 54 / 87

55 Signed zero V 1/x = +, 1/y =, + There are always pros and cons for floating-point design Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 55 / 87

56 HW 2-2 I If if (a < 0) always holds and b is not too large or too small, how do we guarantee if a/max(b, 0.0) < 0 always holds If max(b,0.0) returns 0.0, then it may not hold The definition of your max Cannot be just a simple if statement Your max need to return +0.0 but not 0.0 Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 56 / 87

57 HW 2-2 II How to specifically assign +0.0 and -0.0? How to use subroutines to get the sign of a number? In a regular program, if you write 0.0, is it +0.0 or -0.0? Find the statement in the manual saying that 0.0 means +0.0 Do some experiments to check your arguments Use glibc but not other systems Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 57 / 87

58 Denormalized number I β = 10, p = 3, e min = 98, x = , y = x, y are ok but x y = rounded to 0, even though x y How important to preserve x = y x y = 0 if (x y) {z = 1/(x-y);} The statement is true, but z becomes NaN Tracking such bugs is frustrating Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 58 / 87

59 Denormalized number II IEEE uses denormalized numbers Guarantee x = y x y = 0 Details of how this is done are not discussed here Most controversial part caused long delay of the standard If denormalized number is used, is also a floating-point number Remember we do not store 1 of 1.d d How to represent denormalized numbers? If e e min 1.d d 2 e Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 59 / 87

60 Denormalized number III d d are stored digits e = e min 1 0.d d 2 e underflow due to cancellation Underflow: smaller than the smallest floating-point number An example of using denormalized numbers Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 60 / 87

61 Denormalized number IV Large relative error happens even without cancellation a + bi c + di = = (a + bi)(c di) (c + di)(c di) ac + bd bc ad + c 2 + d 2 c 2 + d i 2 If c or d > ββ e max/2 overflow overflow: larger than the maximal floating-point number Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 61 / 87

62 Denormalized number V Smith s formula a + bi c + di = { a+b(d/c) c+d(d/c) + b a(d/c) c+d(d/c) i b+a(c/d) d+c(c/d) + a+b(c/d) d+c(c/d) i if ( d < c ) if ( d c ) avoid overflow However, using Smith s formula, without denormalized numbers Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 62 / 87

63 Denormalized number VI If a = , b = , c = , d = then d/c = 0.5, c + d(d/c) = , b(d/c) = = 0 a + b(d/c) = Solution = 0.4, wrong Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 63 / 87

64 Denormalized number VII If denormalized numbers are used, can be stored, a + b(d/c) = the correct answer Usually hardware does not support denormalized numbers directly Using software to simulate Programs may be slow if a lot of underflow Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 64 / 87

65 Exception, Flags, Trap handlers I We have mentioned things like overflow, underflow What are other exceptional situations? Motivation: usually when exceptional condition like 1/0 happens, you may want to know IEEE requires vendors to provide a way to get status flags IEEE defines five exceptions: overflow, underflow, division by zero, invalid operation, inexact overflow: larger than the maximal floating-point number Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 65 / 87

66 Exception, Flags, Trap handlers II Underflow: smaller than the smallest floating-point number Invalid: + ( ), 0, 0/0, /, x REM 0, REM y, x, x < 0, any comparison involves a NaN Invalid returns NaN; NaN may not be from invalid operations Inexact: the result is not exact β = 10, p = 3, = 14.7 exact, = not exact Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 66 / 87

67 Exception, Flags, Trap handlers III inexact exception is raised so often, usually we do not care it Exception when trap disabled argument to handler overflow ± or ± e max round(x2 α ) underflow 0, ±2 e min, or denormal round(x2α ) division by zero operands invalid NaN operands inexact round(x) round(x) Trap handler: special subroutines to handle exceptions Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 67 / 87

68 Exception, Flags, Trap handlers IV You can design your own trap handlers In the above table, when trap disabled means results of operations if trap handlers not used α = 192 for single, α = 1536 for double reason: you cannot really store x Examples of using trap handlers described later Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 68 / 87

69 Compiler Options I Compiler may provide a way so the program stops if an exception occurs Easy for debugging Example: SUN s C compiler (I learned this on an old machine) Reason: gcc doesn t have this to explicity detect exceptions -ftrap=t Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 69 / 87

70 Compiler Options II t: %all, %none, common, [no%]invalid, [no%]overflow, [no%]underflow, [no%]division, [no%]inexact. common: invalid, division by zero, and overflow. The default is -ftrap=%none. Example: -ftrap=%all,no%inexact means set all traps, except inexact. If you compile one routine with -ftrap=t, compile all routines of the program with the same -ftrap=t option otherwise, you can get unexpected results. Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 70 / 87

71 Compiler Options III Example: on the screen you will see Note: IEEE floating-point exception flags raise Inexact; Underflow; See the Numerical Computation Guide, ieee_flags gcc: -fno-trapping-math: default -ftrapping-math Setting this option may allow faster code if one relies on non-stop IEEE arithmetic -ftrapv Generates traps for signed overflow on addition, subtraction, multiplication Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 71 / 87

72 Trap Handler I Example: do {... } while {x >= 100;} If x = NaN, an infinite loop Any comparison involves NaN is wrong A trap handler can be installed to abort it Example: Calculate x 1 x n may overflow in the middle (the total may be ok!): Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 72 / 87

73 Trap Handler II for (i = 1; i <= n; i++) p = p * x[i] ; x 1 x r, r n overflow but x 1 x n may be in the range e log(x i ) a solution but less accurate and costs more A possible solution Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 73 / 87

74 Trap Handler III for (i = 1; i <= n; i++) { if (p * x[i] overflow) { p = p * pow(10,-a); count = count + 1 ; } p = p * x[i] ; } p = p * pow(10, a*count) ; Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 74 / 87

75 An Example of Handlers I Example using SUN s numerical computation guide Again, old. Reason of not using glibc: so you can have HW standard math library libm.a exp, pow, log,... Additional math library: libsunmath.a exp2, exp10,..., ieee flags, ieee handler, ieee retrospective A program: Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 75 / 87

76 An Example of Handlers II #include <stdio.h> #include <sys/ieeefp.h> #include <sunmath.h> #include <siginfo.h> #include <ucontext.h> void handler(int sig, siginfo_t *sip, ucontext_t *uap) { unsigned code, addr; code = sip->si_code; Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 76 / 87

77 An Example of Handlers III addr = (unsigned) sip->si_addr; fprintf(stderr, "fp exception %x at address %x \n", code, addr); } int main() { double x; /* trap on common floating point exceptions */ if (ieee_handler("set", "common", handler)!= 0) Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 77 / 87

78 An Example of Handlers IV printf("did not set exception handler \n"); /* cause an underflow exception (not reported) */ x = min_normal(); printf("min_normal = %g \n", x); x = x / 13.0; printf("min_normal / 13.0 = %g \n", x); /* cause an overflow exception (reported) */ Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 78 / 87

79 An Example of Handlers V x = max_normal(); printf("max_normal = %g \n", x); x = x * x; printf("max_normal * max_normal = %g \n", x); } ieee_retrospective(stderr); return 0; Result: Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 79 / 87

80 An Example of Handlers VI min_normal = e-308 min_normal / 13.0 = e-309 max_normal = e+308 fp exception 4 at address 10d0c max_normal * max_normal = e+308 Note: IEEE floating-point exception flags raise Inexact; Underflow; IEEE floating-point exception traps enabled: overflow; division by zero; invalid operatio See the Numerical Computation Guide, ieee_flags ieee_handler(3m) Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 80 / 87

81 An Example of Handlers VII invalid, division, and overflow sometimes called common exceptions here ieee handler( set, common, handler) means handlers used for common exceptions handler: subroutines to handle exceptions HW 3-1: regenerate this example using GNU C library How to find GNU C library information: on linux, type % info libc check the category of Arithmetics and Signal Handling Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 81 / 87

82 The Use of Flags: An Example I Calculate x n, n : integer double pow(double x, int n) { double tmp = x, ret = 1.0; for(int t=n; t>0; t/=2) { if(t%2==1) ret*=tmp; tmp = tmp * tmp; } return ret; Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 82 / 87

83 The Use of Flags: An Example II } x 16 = (x 2 ) 8 =, x 15 = x(x 2 ) 7, treat x 2 as the new x x 15 = x(x 2 ) 7 = x(x 2 )(x 4 ) 3 = x(x 2 )(x 4 )(x 8 ) 1 If n < 0, we need to use x n = (1/x) n = 1/(x) n pow(1/x, n) less accurate, 1/pow(x, n) is better There is already error on 1/x Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 83 / 87

84 The Use of Flags: An Example III Example: (1/2) 5 and 1/(2 5 ) A small problem on using 1/pow(x, n): if pow(x, n) underflow (i.e. when x < 1, n < 0), either underflow trap handler or underflow status flag set incorrect x n underflow, x n overflow or be in range (e min = 126, 2 e min = 2126 < = 2 e max ) Turn off overflow & underflow trap enable bits, save overflow & underflow status bits Compute 1/pow(x, n) Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 84 / 87

85 The Use of Flags: An Example IV If neither overflow or underflow status is set restore them If one is set, restore & calculate pow(1/x, n), which causes correct exception to occur Practically the calculation of pow() is more complicated e.g. google e pow.c and e log.c Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 85 / 87

86 The Use of Flags: An Example V Another example: calculate arccos x = 2 arctan 1 x 1 + x cos θ = x = 2 cos 2 θ 2 1 = 1 2 θ sin2 2 cos θ x = 2, sin θ 1 x 2 = 2, tan θ 1 x 2 = 1 + x Hence arccos x = 2 arctan 1 x 1 + x Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 86 / 87

87 The Use of Flags: An Example VI Consider x = 1 arctan( ) = π/2 arccos( 1) = π A small problem: 1 x 1+x causes the divide-by-zero flag set though arccos( 1) not exceptional Solution: save divide-by-zero flag, restore after arccos computation Chih-Jen Lin (National Taiwan Univ.) Floating Point Operations 87 / 87

Classes of Real Numbers 1/2. The Real Line

Classes of Real Numbers 1/2. The Real Line Classes of Real Numbers All real numbers can be represented by a line: 1/2 π 1 0 1 2 3 4 real numbers The Real Line { integers rational numbers non-integral fractions irrational numbers Rational numbers