Floating-point representation

Similar documents
2 Computation with Floating-Point Numbers

2 Computation with Floating-Point Numbers

Floating-Point Arithmetic

CS321 Introduction To Numerical Methods

Most nonzero floating-point numbers are normalized. This means they can be expressed as. x = ±(1 + f) 2 e. 0 f < 1

Floating-Point Numbers in Digital Computers

Floating-Point Numbers in Digital Computers

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Mathematical preliminaries and error analysis

1.2 Round-off Errors and Computer Arithmetic

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

Review Questions 26 CHAPTER 1. SCIENTIFIC COMPUTING

Floating-point numbers. Phys 420/580 Lecture 6

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Finite arithmetic and error analysis

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

What we need to know about error: Class Outline. Computational Methods CMSC/AMSC/MAPL 460. Errors in data and computation

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

Binary floating point encodings

Computational Economics and Finance

MATH 353 Engineering mathematics III

Computer Arithmetic. 1. Floating-point representation of numbers (scientific notation) has four components, for example, 3.

Floating Point January 24, 2008

Computational Mathematics: Models, Methods and Analysis. Zhilin Li

Scientific Computing: An Introductory Survey

Lecture Objectives. Structured Programming & an Introduction to Error. Review the basic good habits of programming

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point Error Analysis. Ramani Duraiswami, Dept. of Computer Science

Roundoff Errors and Computer Arithmetic

Section 1.4 Mathematics on the Computer: Floating Point Arithmetic

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Foundations of Computer Systems

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Computer Organization: A Programmer's Perspective

Floating Point Representation. CS Summer 2008 Jonathan Kaldor

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Computational Economics and Finance

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

System Programming CISC 360. Floating Point September 16, 2008

Floating Point Numbers

Up next. Midterm. Today s lecture. To follow

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Number Systems. Both numbers are positive

Review of Calculus, cont d

Data Representation Floating Point

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

Data Representation Floating Point

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Floating-Point Arithmetic

Floating Point Arithmetic

Giving credit where credit is due

1.3 Floating Point Form

Giving credit where credit is due

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

CS429: Computer Organization and Architecture

Representing and Manipulating Floating Points

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

Floating Point Arithmetic

CS321. Introduction to Numerical Methods

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

Classes of Real Numbers 1/2. The Real Line

Numerical Computing: An Introduction

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Representing and Manipulating Floating Points

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

COMP2611: Computer Organization. Data Representation

Representing and Manipulating Floating Points. Jo, Heeseung

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Computing Basics. 1 Sources of Error LECTURE NOTES ECO 613/614 FALL 2007 KAREN A. KOPECKY

Computing Fundamentals

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

CHAPTER 2 SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS

CGF Lecture 2 Numbers

Numerical computing. How computers store real numbers and the problems that result

Odd-Numbered Answers to Exercise Set 1.1: Numbers

Floating Point Considerations

Computational Methods. Sources of Errors

FLOATING POINT NUMBERS

Chapter 3: Arithmetic for Computers

Today: Floating Point. Floating Point. Fractional Binary Numbers. Fractional binary numbers. bi bi 1 b2 b1 b0 b 1 b 2 b 3 b j

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Divide: Paper & Pencil

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

IEEE Standard 754 Floating Point Numbers

CS101 Lecture 04: Binary Arithmetic

Floating point numbers in Scilab

Scientific Computing. Error Analysis

Module 2: Computer Arithmetic

ECE232: Hardware Organization and Design

Computer Systems C S Cynthia Lee

Digital Logic. The Binary System is a way of writing numbers using only the digits 0 and 1. This is the method used by the (digital) computer.

CO212 Lecture 10: Arithmetic & Logical Unit

Representing and Manipulating Floating Points

Transcription:

Lecture 3-4: Floating-point representation and arithmetic Floating-point representation The notion of real numbers in mathematics is convenient for hand computations and formula manipulations. However, real numbers are not well-suited for general purpose computation, because ther numeric representation as a string of digits expressed in, say, base 10 can be very long or even infinitely long. Examples include π,, e, and 1/3. In practice, computers store numbers with finite precision. Numbers and arithmetic used in scientific computation should meet a few general criteria: Numbers should have modest storage requirements Arithmetic operations should be efficient to carry out A level of standardization, or portability, is desirable results obtained on one computer should closely match the results of the same computation on other computers Internationally-standardized methods for representing numbers on computers have been established by the IEEE to satisfy these basic goals. These notes are concerned with double-precision floating-point numbers, available on most modern computing systems. 1 A precise definition appears below. The term floatingpoint number always means a number that can be exactly represented by a double-precision floating-point number in these notes. Unlike real numbers, there are only finitely many floating-point numbers. In particular, there is a smallest (and largest) floating-point number. In between there are also necessarily many gaps of numbers that can t be represented exactly. In hand computations, we usually represent numbers in terms of decimal digits. For instance, x = 1.9 10 (1) is one way to write the number 19. This way of writing numbers is sometimes called scientific notation. The 1.9 part is called the mantissa, 10 the base, and the exponent. Another way to write x is 1 10 + 10 1 +9 10 0. If, like in equation (1), the mantissa always has 3 decimal digits and the exponent has one decimal digit, then the distance between the number 1, which can be represented exactly, and a closest different representable number, 0.99 10 0, is 1 10. The largest representable number in this scheme is 9.99 10 9 and the closest distinct positive number is 9.98 10 9. Their distance is 9.99 10 9 9.98 10 9 = 1 10 7, which is large. Thus, the distance between distinct representable numbers grows with their magnitude; it can be large for numbers of large magnitude. Similarly to (1), in base, the number y = 1.01 may be a suitable representation of 101 = 1 +0 1 +1 0, which is a binary representation of the number 5 in base 10. A binary digit is referred to as a bit. Computers represent floating-point numbers using base. Floating-point numbers are generally stored in the form x = ±(1 + f) e, () 1 Notable exceptions include graphics processors 1

where f is referred to as the fractional part and e as the exponent. The fractional part f satisfies 0 f < 1 and is stored in 5 bits. That means that 5 f is an integer in the interval [0, 5 ). The exponent is restricted to be an integer in the interval 10 e 103. Instead of storing e computers store e + 103 using 11 bits. (This obviates the need to store the sign of e.) The sign of x is stored in a separate bit. Thus, the floating-point number x is stored in 5 + 11 + 1 = 64 bits. Note that the number 1 in the (1 + f) part does not have to be stored explicitly. The binary representation () of x is said to be normalized, because of the leading 1 in 1 = f. Also note that 11 bits can store more integer values than the exponent e is allowed to take on as defined above. The extra values are reserved for some special cases discussed below. The representable floating-point numbers are equidistant in each interval e x e+1, with distance e 11. The constant eps, or machine epsilon is defined to be the distance between 1 and the next largest floating point number. In a relative sense, eps is as large as the gaps between floating-point numbers get. Let the function fl(x) return the closest floating-point number to the real number x. Then x fl(x) eps x (3) In IEEE arithmetic, we have eps = 5, i.e., eps. 10 16. The symbol stands for approximately equal to. You can find out the acutal value of eps by first typing format long and then eps in Octave or Matlab. The property (3) also can be stated in the form that for every real number x, there is an ǫ with ǫ eps/, such that fl(x) = x(1 + ǫ). (4) Example: Neither the decimal numbers 0.7 nor 0.1 can be stored exactly as floating-point numbers, and therefore fl(0.7)/fl(0.1) 7. Example: The largest representable normalized floating point number, realmax, corresponds to f = 1 eps and e = 103 in (1), i.e., realmax = ( eps) 103. Thus, realmax 1.80 10 308. Real numbers larger than realmax can t be represented by floating-point numbers. The special floatingpoint number defined by f = 0 and e = 104 is called Inf. If in the course of arithmetic operations, a variable, say y, becomes larger than realmax, then y is set to Inf. Replacing a large real number by Inf is referred to as overflow. The expression 1/Inf is defined to be 0 and Inf + Inf = Inf. Computations that yield undefined numbers, such as 1/0, 0/0, and Inf/Inf are represented by yet another special case, NaN. NaN is defined by f 0 and e = 104. Arithmetic operations with NaN always give back NaN. For example, NaN + 1 = NaN. The smallest representable positive normalized floating-point number, 10, is denoted by realmin and represented by f = 0 and e = 10. Thus, realmin. 10 308. If during arithmetic operations an intermediate result, say y, is smaller than realmin, then y is set to zero unless unnormalized floating-point number are used. Replacing a tiny real number by zero is referred to as underflow. Example: Let y = 10 308. First divide y by 10 and then multiply by 10 309. On a computer that does not use unnormalized floating-point numbers, we obtain the answer zero due to underflow. This illustrates that underflow may lead to large errors if the intermediate underflown quantity subsequently is multiplied by a large number. Some computers allow a representation of numbers between eps realmin and realmin by denormalized floating-point numbers. Then numbers as small as 0.49 10 33 can be represented. Generally, this feature is not important, but it leads to more accurate answers in certain computations.

Exercises Exercise 3.1: Determine the decimal representation of the binary numbers (a) 10, (b) 1011. Exercise 3.: Determine the binary representation of the decimal number (a) 3, (b) 17. Exercise 3.3: Determine whether your computer uses unnormalized floating-point numbers. Exercise 3.4: If you own a hand-held calculator or have access to one (for example, those graphing calculators popular these days), determine its machine epsilon. Floating-point arithmetic Arithmetic operations on floating-point numbers do not often result in another floating-point number. For example, 1 and 10 are floating-point numbers, but 1/10 is not. The result of an arithmetic computation will be stored by the computer as a floating-point number. If the exact result is not a floating-point number, an error is introduced. Let the symbol represent one of the four basic mathematical arithmetic operations: addition, subtraction, multiplication, or division. Let denote its floating-point analogue. Computers use these operations to carry out most computations. We would like that for any floating-point numbers x and y, x y = fl(x y). Combining this property and (4) shows that for all floating-point numbers x and y, there is an ǫ with ǫ eps/, such that x y = (x y)(1 + ǫ), (5) The properties (4) and (5) are satisfied by arithmetic carried out on computer systems that satisfy the IEEE standard, except when overflow occurs. Example: Computers do not know the functions cos(x), sin(x), and e x for general values of x. Approximations of their values are computed by evaluating rational functions r(x) = p(x)/q(x), where p and q are suitable polynomials. Note that the evaluation of r(x) only requires the four basic arithmetic operations. Often the use of floating-point numbers and floating-point arithmetic does not affect the outcome significantly, when compared with applying exact arithmetic on the corresponding real numbers. However, some care should be taken in the design of numerical methods to maintain high accuracy. We conclude with definitions of the absolute and relative errors in an available approximation x of an error-free real quantity x: absolute error in x = x x (6) and Exercises relative error in x = x x x. (7) Exercise 3.5: Give an example when the properties (3) and (4) are violated. Exercise 3.6: What can we say about the property (5) when (4) is violated? 3

Exercise 3.7: Discuss the computation of f 1 (x) = 1 x for values of x > 0 close to zero. Assume that the computed value of the square-root function satisfies fl( x) = x(1 + ǫ). Compare the results with those obtained by evaluating f (x) = (1 x)(1 + x) and f 3 (x) = (1 x) (1 + x). Exercise 3.8: The solutions of the quadratic equation are given by x + ax + b = 0 x 1, = a ± a 4b. Use this formula to determine the solutions when a = 100000000 and b = 1 in Octave or Matlab. Which root is determined accurately and which one is not? Here we measure the accuracy in the number of correct significant digits in the computed solution. Why is one of the roots determined to low accuracy? Let x 1 denote the computed zero with high accuracy. In this example, one can achieve higher accuracy in x by using the relation x 1 x = b. Explain how this can be done. What is the computed root determined in this manner? Exercise 3.9: Consider a computer with 4 digits of precision on which floating-point numbers x = d 1.d d 3 d 4... 10 e are represented by fl(x) = d 1.d d 3 d 4 10 e. Here each d j and e represent one decimal digit. (a) Determine the computed value of 100.0 + 0.001 on this computer. (b) The exact solution of the linear system of equations [ 0.001 100.0 100.0 100.0 ] [ 100.0 x = 0 is close to x = [ 1, 1] T. Solve this system by Gaussian elimination (without pivoting) on this computer. What is the answer? What is the error in the Euclidean norm? Exercise 3.10: We would like to determine the value of the sum 1 S = j and therefore write an Octave or Matlab program to sum the first terms. Specifically, sum the first n terms in order to compute the partial sum j=1 S n = 1 + 1 + 1 3 +... + 1 n, where n is chosen so that fl(s n + 1 n+1 ) = fl(s n). What is n? To determine whether the computed value S n is an accurate approximation of S, compute the sum Ŝ n = 1 n + 1 n 1 +... + 1 + 1. where summation is carried out in the order indicated. Are Ŝn and S n the same? What can we conclude about S? ] 4

Exercise 3.11: Generate 100 normally distributed random numbers with the Octave/Matlab function randn. Sum them in the order generated to yield s 1. Also sum the positive numbers and the negative numbers separately and then add these sums to give s. Are s 1 and s the same? If not, explain why this is the case. Which ordering is better? Are there other, better, orderings of the numbers? Run numerical experiments with several sets of 100 normally distributed random numbers. Justify your answers. 5