Floating-Point Arithmetic

Similar documents
Computer Arithmetic. 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.

Binary floating point encodings

Classes of Real Numbers 1/2. The Real Line

2 Computation with Floating-Point Numbers

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

Floating-point representation

2 Computation with Floating-Point Numbers

Floating Point Representation in Computers

Floating-Point Numbers in Digital Computers

Scientific Computing: An Introductory Survey

Floating-Point Numbers in Digital Computers

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Roundoff Errors and Computer Arithmetic

Mathematical preliminaries and error analysis

Finite arithmetic and error analysis

CS321. Introduction to Numerical Methods

Most nonzero floating-point numbers are normalized. This means they can be expressed as. x = ±(1 + f) 2 e. 0 f < 1

Computational Mathematics: Models, Methods and Analysis. Zhilin Li

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Floating-Point Arithmetic

Chapter 3. Errors and numerical stability

Numerical Computing: An Introduction

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Divide: Paper & Pencil

Floating Point Numbers

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

1.2 Round-off Errors and Computer Arithmetic

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Floating Point Representation. CS Summer 2008 Jonathan Kaldor

Class 5. Data Representation and Introduction to Visualization

ECE232: Hardware Organization and Design

Review of Calculus, cont d

Objectives. look at floating point representation in its basic form expose errors of a different form: rounding error highlight IEEE-754 standard

Review Questions 26 CHAPTER 1. SCIENTIFIC COMPUTING

CS321 Introduction To Numerical Methods

Data Representation and Introduction to Visualization

Introduction to floating point arithmetic

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Floating-point representations

Floating-point representations

Data Representation Floating Point

Up next. Midterm. Today s lecture. To follow

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Data Representation Floating Point

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point Error Analysis. Ramani Duraiswami, Dept. of Computer Science

Scientific Computing: An Introductory Survey

Floating Point Arithmetic

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. What is scientific computing?

Computational Methods. Sources of Errors

Lecture 10. Floating point arithmetic GPUs in perspective

The Perils of Floating Point

What Every Programmer Should Know About Floating-Point Arithmetic

Hani Mehrpouyan, California State University, Bakersfield. Signals and Systems

Floating Point Numbers. Lecture 9 CAP

ME 261: Numerical Analysis. ME 261: Numerical Analysis

2.1.1 Fixed-Point (or Integer) Arithmetic

Numerical Methods 5633

Floating-point numbers. Phys 420/580 Lecture 6

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Section 1.4 Mathematics on the Computer: Floating Point Arithmetic

Computer arithmetics: integers, binary floating-point, and decimal floating-point

Lecture 13: (Integer Multiplication and Division) FLOATING POINT NUMBERS

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

ecture 25 Floating Point Friedland and Weaver Computer Science 61C Spring 2017 March 17th, 2017

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

CS 61C: Great Ideas in Computer Architecture Floating Point Arithmetic

Outline. 1 Scientific Computing. 2 Approximations. 3 Computer Arithmetic. Scientific Computing Approximations Computer Arithmetic

Numerical computing. How computers store real numbers and the problems that result

Basics of Computation. PHY 604:Computational Methods in Physics and Astrophysics II

Number Systems. Both numbers are positive

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

Lecture 17 Floating point

AMTH142 Lecture 10. Scilab Graphs Floating Point Arithmetic

What we need to know about error: Class Outline. Computational Methods CMSC/AMSC/MAPL 460. Errors in data and computation

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

Chapter 3: Arithmetic for Computers

Computer Representation of Numbers and Computer Arithmetic

Computer Systems C S Cynthia Lee

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Floating point numbers in Scilab

FP_IEEE_DENORM_GET_ Procedure

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

COSC 243. Data Representation 3. Lecture 3 - Data Representation 3 1. COSC 243 (Computer Architecture)

Computational Economics and Finance

Numeric Encodings Prof. James L. Frankel Harvard University

Computer arithmetics: integers, binary floating-point, and decimal floating-point

Lecture Objectives. Structured Programming & an Introduction to Error. Review the basic good habits of programming

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Floating Point Arithmetic

MATH 353 Engineering mathematics III

Representing and Manipulating Floating Points. Jo, Heeseung

Representing and Manipulating Floating Points

Floating Point Arithmetic

Lecture Notes: Floating-Point Numbers

Floating-point operations I

Transcription:

Floating-Point Arithmetic ECS30 Winter 207 January 27, 207

Floating point numbers Floating-point representation of numbers (scientific notation) has four components, for example, 3.46 0 sign significand base exponent Computers use binary (base 2) representation, each digit is the integer 0 or, e.g. 00 = 2 4 + 0 2 3 + 2 2 + 2 + 0 2 0 = 22 in base 0 and 0 = 6 + 32 + 256 + 52 + 4096 + 892 +... = 2 4 + 2 5 + 2 8 + 2 9 + 2 2 + 2 3 +... =.0000... 2 4

Floating point numbers The representation of a floating point number: x = ±b 0.b b 2 b p 2 E where It is normalized, i.e., b0 = (the hidden bit) Precision (= p) is the number of bits in the significand (mantissa) (including the hidden bit). Machine epsilon ɛm = 2 (p ), the gap between the number and the smallest floating point number that is greater than. Special numbers 0, 0, Inf =, Inf =, NaN = Not a Number.

IEEE standard All computers designed since 985 use the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-985), represent each number as a binary number and use binary arithmetic. Essentials of the IEEE standard:. consistent representation of floating-point numbers by all machines adopting the standard; 2. correctly rounded floating-point operations, using various rounding modes; 3. consistent treatment of exceptional situation such as division by zero. Citation of 989 Turing Award to William Kahan:... his devotion to providing systems that can be safely and robustly used for numerical computations has impacted the lives of virtually anyone who will use a computer in the future

IEEE single precision format Single format takes 32 bits (=4 bytes) long: 8 23 s E f sign exponent binary point fraction It represents the number ( ) s (.f) 2 E 27 The leading in the fraction need not be stored explicitly since it is always (hidden bit) Precision p = 24 and machine epsilon ɛ m = 2 23.2 0 7 The E 27 in the exponent is to avoid the need for storage of a sign bit. E min = (0000000) 2 = () 0, E max = (0) 2 = (254) 0. The range of positive normalized numbers: N min =.00 0 2 Emin 27 = 2 26.2 0 38 N max =. 2 Emax 27 2 28 3.4 0 38. Special repsentations for 0, ± and NaN.

IEEE double pecision format IEEE double format takes 64 bits (= 8 bytes) long: 52 s E f sign exponent binary point fraction It represents the numer ( ) s (.f) 2 E 023 Precision p = 53 and machine epsilon ɛ m = 2 52 2.2 0 6 The range of positive normalized numbers is from N min = 2 022 2.2 0 308 N max =. 2 023 2 024.8 0 308. Special repsentations for 0, ± and NaN.

Rounding modes Let a positive real number x be in the normalized range, i.e., N min x N max, and write in the normalized form x =.b b 2 b p b p b p+... 2 E, Then the closest floating-pont number less than or equal to x is i.e., x is obtained by truncating. x =.b b 2 b p 2 E The next floating-point number bigger than x (also the next one that bigger than x) is x + = (.b b 2 b p + 0.00 0) 2 E If x is negative, the situtation is reversed.

rounding modes, cont d Four rounding modes:. round down: fl(x) = x 2. round up: fl(x) = x + 3. round towards zero: fl(x) = x of x 0 fl(x) = x + of x 0 4. round to nearest (IEEE default rounding mode): fl(x) = x or x + whichever is nearer to x. Note: except that if x > Nmax, fl(x) =, and if x < Nmax, fl(x) =. In the case of tie, i.e., x and x + are the same distance from x, the one with its least significant bit equal to zero is chosen.

Rounding error When the round to nearest is in effect, Therefore, we have relerr = relerr(x) = fl(x) x x 2 ɛ m. 2 2 24 = 2 24 5.96 0 8, single precision 2 2 52. 0 6, double precision.

Floating-point arithmetic IEEE rules for correctly rounded floating-point operations: if x and y are correctly rounded floating-point numbers, then where for the round to nearest, fl(x + y) = (x + y)( + δ) fl(x y) = (x y)( + δ) fl(x y) = (x y)( + δ) fl(x/y) = (x/y)( + δ) δ 2 ɛ m IEEE standard also requires that correctly rounded remainder and square root operations be provided.

Floating-point arithmetic, cont d IEEE standard response to exceptions Event Example Set result to Invalid operation 0/0, 0 NaN Division by zero Finite nonzero/0 ± Overflow x > N max ± or ±N max underflow x 0, x < N min ±0, ±N min or subnormal Inexact whenever fl(x y) x y correctly rounded value

Floating-point arithmetic error Let ˆx and ŷ be the floating-point numbers and that ˆx = x( + τ ) and ŷ = y( + τ 2 ), for τ i τ where τ i could be the relative errors in the process of collecting/getting the data from the original source or the previous operations, and Question: how do the four basic arithmetic operations behave?

Floating-point arithmetic error: +, Addition and subtraction: fl(ˆx + ŷ) = (ˆx + ŷ)( + δ), δ 2 ɛm = x( + τ )( + δ) + y( + τ 2)( + δ) = x + y + x(τ + δ + O(τɛ m)) + y(τ 2 + δ + O(τɛ m)) ( = (x + y) + x (τ + δ + O(τɛm)) + y ) (τ2 + δ + O(τɛm)) x + y x + y (x + y)( + ˆδ), where ˆδ can be bounded as follows: ˆδ x + y x + y ( τ + ) 2 ɛ m + O(τɛ m ).

Floating-point arithmetic error: +, Three possible cases:. If x and y have the same sign, i.e., xy > 0, then x + y = x + y ; this implies ˆδ τ + 2 ɛ m + O(τɛ m ). Thus fl(ˆx + ŷ) approximates x + y well. 2. If x y x + y 0, then ( x + y )/ x + y ; this implies that ˆδ could be nearly or much bigger than. This is so called catastrophic cancellation, it causes relative errors or uncertainties already presented in ˆx and ŷ to be magnified. 3. In general, if ( x + y )/ x + y is not too big, fl(ˆx + ŷ) provides a good approximation to x + y.

Catastrophic cancellation: example Computing x + x straightforward causes substantial loss of significant digits for large n x fl( x + ) fl( x) fl(fl( x + ) fl( x).00e+0.00000000004999994e+05.00000000000000000e+05 4.9999944672658e-06.00e+ 3.6227766084906e+05 3.622776606837908e+05.58529005342865e-06.00e+2.00000000000050000e+06.00000000000000000e+06 5.00003807246685028e-07.00e+3 3.622776606853740e+06 3.622776606837955e+06.5785976397323608e-07.00e+4.00000000000000503e+07.00000000000000000e+07 5.029490292358398e-08.00e+5 3.62277660683804e+07 3.62277660683797e+07.8626454923095703e-08.00e+6.00000000000000000e+08.00000000000000000e+08 0.00000000000000000e+00

Catastrophic cancellation: example Catastrophic cancellation can sometimes be avoided if a formula is properly reformulated. For example, one can compute x + x almost to full precision by using the equality x + x = x + + x. Consequently, the computed results are n fl(/( n + + n)).00e+0 4.999999999875000e-06.00e+.5838830080237e-06.00e+2 4.999999999998749e-07.00e+3.583883008450e-07.00e+4 4.999999999999987e-08.00e+5.583883008489e-08.00e+6 5.000000000000000e-09

Catastrophic cancellation: example 2 Consider the function h(x) = cos x x 2 Note that 0 f(x) < /2 for all x 0. Let x =.2 0 8, then the computed is completely wrong! fl(h(x)) = 0.770988... Alternatively, the function can be re-written as ( ) 2 sin(x/2) h(x) =. x/2 Consequently, for x =.2 0 8, then the computed function fl(h(x)) = 0.499999... < /2 is fine!

Floating-point arithmetic error:, / Multiplication and Division: fl(ˆx ŷ) = (ˆx ŷ)( + δ) = xy( + τ )( + τ 2 )( + δ) xy( + ˆδ ), fl(ˆx/ŷ) = (ˆx/ŷ)( + δ) = (x/y)( + τ )( + τ 2 ) ( + δ) xy( + ˆδ ), where ˆδ = τ + τ 2 + δ + O(τɛ m ) ˆδ = τ τ 2 + δ + O(τɛ m ). Thus ˆδ 2τ + 2 ɛ m + O(τɛ m ) and ˆδ 2τ + 2 ɛ m + O(τɛ m ). Multiplication and division are very well-behaved!

Reading Section.7 of Numerical Computing with MATLAB by C. Moler Websites discussions of numerical disasters: T. Huckle, Collection of software bugs http://www5.in.tum.de/ huckle/bugse.html K. Vuik, Some disasters caused by numerical errors http://ta.twi.tudelft.nl/nw/users/vuik/wi2/disasters.html D. Arnold, Some disasters attributable to bad numerical computing http://www.ima.umn.edu/ arnold/disasters/disasters.html In-depth material: D. Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Survey, Vol.23(), pp.5-48, 99

LU factorization revisited: the need of pivoting in LU factorization, numerically LU factorization without pivoting [ ] [ ] [.000 0 4 A = = = LU = l 2 ] [ ] u u 2 u 22 In three decimal-digit floating-point arithmetic, we have [ ] [ ] 0 0 L = fl(/0 4 = ) 0 4, [ ] [ ] 0 4 0 4 Û = fl( 0 4 = ) 0 4, Check: LÛ = [ 0 0 4 ] [ ] [ 0 4 0 4 0 4 = 0 ] A,

LU factorization revisited: the need of pivoting in LU factorization, numerically [ ] Consider solving Ax = for x using this LU factorization. 2 Solving Ly = [ 2 ] = ŷ = fl(/) = ŷ 2 = fl(2 0 4 ) = 0 4. note that the value 2 has been lost by subtracting 0 4 from it. Solving [ ] Ûx = ŷ = 0 4 = x2 = fl(( 04 )/( 0 4 )) = x = fl(( )/0 4 ) = 0, [ ] [ ] x 0 x = = is a completely erroneous solution to the correct x 2 [ ] answer x.

LU factorization revisited: the need of pivoting in LU factorization, numerically LU factorization with partial pivoting P A = LU, By exchanging the order of the rows of A, i.e., [ ] 0 P =, 0 Then for the LU factorization of P A is given by [ ] [ ] 0 0 L = fl(0 4 = /) 0 4, [ ] [ ] Û = fl( 0 4 =. ) The computed LU approximates A very accurately. As a result, the computed solution x is also perfect!