Lecture Notes: Floating-Point Numbers

Similar documents
Most nonzero floating-point numbers are normalized. This means they can be expressed as. x = ±(1 + f) 2 e. 0 f < 1

Floating-Point Numbers in Digital Computers

2 Computation with Floating-Point Numbers

What we need to know about error: Class Outline. Computational Methods CMSC/AMSC/MAPL 460. Errors in data and computation

Floating Point Arithmetic

Floating Point Representation. CS Summer 2008 Jonathan Kaldor

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

Floating-Point Numbers in Digital Computers

MATH 353 Engineering mathematics III

Floating Point Numbers

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

FLOATING POINT NUMBERS

Floating Point Arithmetic

Number Systems. Decimal numbers. Binary numbers. Chapter 1 <1> 8's column. 1000's column. 2's column. 4's column

Signed umbers. Sign/Magnitude otation

Roundoff Errors and Computer Arithmetic

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

Floating-point representation

2 Computation with Floating-Point Numbers

CS321. Introduction to Numerical Methods

Mathematical preliminaries and error analysis

Section 1.4 Mathematics on the Computer: Floating Point Arithmetic

Review of Calculus, cont d

Number Systems Standard positional representation of numbers: An unsigned number with whole and fraction portions is represented as:

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Digital Logic. The Binary System is a way of writing numbers using only the digits 0 and 1. This is the method used by the (digital) computer.

1.2 Round-off Errors and Computer Arithmetic

Characters, Strings, and Floats

Introduction to Scientific Computing Lecture 1

Review Questions 26 CHAPTER 1. SCIENTIFIC COMPUTING

Lecture Objectives. Structured Programming & an Introduction to Error. Review the basic good habits of programming

Divide: Paper & Pencil

CS101 Lecture 04: Binary Arithmetic

AMTH142 Lecture 10. Scilab Graphs Floating Point Arithmetic

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Number Systems. Both numbers are positive

MACHINE LEVEL REPRESENTATION OF DATA

Binary floating point encodings

IEEE Standard 754 Floating Point Numbers

Floating-Point Arithmetic

15213 Recitation 2: Floating Point

Today CISC124. Building Modular Code. Designing Methods. Designing Methods, Cont. Designing Methods, Cont. Assignment 1 due this Friday, 7pm.

Chapter 3: Arithmetic for Computers

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

CHW 261: Logic Design

Chapter 3. Errors and numerical stability

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

Numeral Systems. -Numeral System -Positional systems -Decimal -Binary -Octal. Subjects:

Scientific Computing: An Introductory Survey

Number Systems. Binary Numbers. Appendix. Decimal notation represents numbers as powers of 10, for example

CHAPTER 5: Representing Numerical Data

Floating Point Numbers. Lecture 9 CAP

CS321 Introduction To Numerical Methods

Objectives. look at floating point representation in its basic form expose errors of a different form: rounding error highlight IEEE-754 standard

CHAPTER 2 Number Systems

ECE232: Hardware Organization and Design

Computational Mathematics: Models, Methods and Analysis. Zhilin Li

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

CPE 323 REVIEW DATA TYPES AND NUMBER REPRESENTATIONS IN MODERN COMPUTERS

CPE 323 REVIEW DATA TYPES AND NUMBER REPRESENTATIONS IN MODERN COMPUTERS

Chapter 2. Data Representation in Computer Systems

Floating Point Representation in Computers

1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Numerical Methods 5633

Introduction to Computers and Programming. Numeric Values

Computer Arithmetic Floating Point

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Chapter Three. Arithmetic

FLOATING POINT NUMBERS

Floating-point numbers. Phys 420/580 Lecture 6

Inf2C - Computer Systems Lecture 2 Data Representation

1.3 Floating Point Form

Data Representation Floating Point

CS 101: Computer Programming and Utilization

Homework 1 graded and returned in class today. Solutions posted online. Request regrades by next class period. Question 10 treated as extra credit

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point Error Analysis. Ramani Duraiswami, Dept. of Computer Science

Integers and Floating Point

ecture 25 Floating Point Friedland and Weaver Computer Science 61C Spring 2017 March 17th, 2017

Data Representation Floating Point

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Foundations of Computer Systems

ECE331: Hardware Organization and Design

Computer Systems C S Cynthia Lee

More Programming Constructs -- Introduction

CS429: Computer Organization and Architecture

Introduction to Computer Science-103. Midterm

unused unused unused unused unused unused

2.1.1 Fixed-Point (or Integer) Arithmetic

Scientific Computing. Error Analysis

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Number systems and binary

ME 261: Numerical Analysis. ME 261: Numerical Analysis

CMPSCI 145 MIDTERM #2 SPRING 2017 April 7, 2017 Professor William T. Verts NAME PROBLEM SCORE POINTS GRAND TOTAL 100

Module 2: Computer Arithmetic

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Transcription:

Lecture Notes: Floating-Point Numbers CS227-Scientific Computing September 8, 2010

What this Lecture is About How computers represent numbers How this affects the accuracy of computation

Positional Number Systems What do we mean when we write a number like 23708? This is shorthand for 2 10 4 + 3 10 3 + 7 10 2 + 8 10 0. Each digit has a value, but the value is weighted by the position the digit is in. The weight associated to each position is a power of ten, so this is a radix ten or base ten positional number system. Every positive integer has a unique representation in this scheme, and we only need a fixed repertory of ten digits to represent arbitrarily large numbers. Observe that the system practically forces you to have a symbol for zero!

Positional Number Systems This system originated in India between the 4th and 6th centuries AD, then spread to the Arab world in the 8th and 9th centuries, and to Europe as early as the 10th century. (This explains the terms Hindu-Arabic numerals and Arabic numerals.) A positional number with sixty as a radix was developed by the Babylonians as early as 3000 BC. The Maya of Central America used a radix twenty positional number system dating from the first century BC. Computers use a radix two, or binary system.

Integers in Binary There are only two digits, 0 and 1. These are called bits: bit is a contraction of binary digit. For instance 110100101 is the binary representation of 2 8 + 2 7 + 2 5 + 2 2 + 2 0, which is four hundred twenty-one. Sometimes, to keep the radices straight, you will see the equation written as or something similar. 110100101 two = 421 ten, Everything in a computer s memory numbers of many different kinds, text, images, audio, program instructions is represented as as sequences of bits. Everything.

Hexadecimal Notation It is hard to read something like 110100101 at a glance. Hexadecimal notation is binary for humans, designed to solve this problem. The idea is to break the bits into groups of four, beginning at the righmost bit. 1 1010 0101 Each group represents an integer between 0 and fifteen, which we denote by single digit, using A, B, C, D, E, F for the vaues ten through fifteen. So for the number above, we get 1A5: It s really just radix sixteen! 1A5 hex = 421 ten.

But that is just part of the story, and not how MATLAB usually represents numbers!

Scientific Notation. You all know this. Distance from Earth to Sun. Gravitational Constant U.S. Balance of Trade Deficit 1.496 10 8 kilometers. 6.673 10 11 m 3 kg sec 2. 3.801 10 9 dollars.

Roundoff and Precision Each number is rounded to show just four significant decimal digits, or four decimal digits of precision. Another way to look at precision is through the relative error x x, x where x is the value we are approximating, and x is the rounded approximation. For example, a more accurate measurement of the mean distance to the sun is 149597871 km, so the relative error is (149600000 149597871) 149597871 1.42 10 5. (The value 10 5 shows this is actually accurate to five digits precision.) Note that relative error is not affected by choice of units.

Calculations with fixed precision Suppose you perform each operation between two values as accurately as possible, but then round off the result to a fixed precision, say two decimal digits of precision. You get some anomalous results, for instance (1/3 + 1/3) 2/3 = (0.33 + 0.33) 0.67 = 0.1 (1.4 + 0.46) + 0.06 = 1.9 + 0.06 = 2.0. Note the loss of precision: the second result is NOT equal to the sum correctly rounded to two digits of precision. On the other hand 1.4 + (0.46 + 0.06) = 1.4 + 0.52 = 1.9, which is correct.

IEEE Double-Precision Floating Point Representation MATLAB uses this by now near-universal standard to represent numbers in a kind of binary version of scientific notation. To see how this works, let s return our earlier example of four hundred twenty-one. This is 110100101 two = 1.10100101 two 2 8. The idea is then to use 64 bits to represent the number. The leftmost bit represents the sign (0 for positive, 1 for negative). The next eleven bits represent the exponent (in this case 8), and the remaining 52 bits represent the piece.10100101.... Note that there is no point in representing the 1 preceding the radix point. (We can t call it a decimal point!)

IEEE Double-Precision Floating Point Representation Here is the result, illustrated with MATLAB: >> num = 421; >> format hex >> num num = 407a500000000000 If we dissect the resulting hexadecimal and write it in binary, we see the three pieces. 0010 0000 0111 1010 0101 0000 0000 0 }{{} sign 010000000111 }{{} exponent 101001000000 }{{ }. mantissa Observe that the representation of the exponent is a bit strange it s not obvious how this represents 8.

Another Example What is one-fifth in binary? Observe that 16 1 5 = 3 + 1 5, so shifting the radix point four bits to the right should be the same as adding 3. This gives 0.00110011001100 = 1.100110011001 2 3. The mantissa.100110011001 should be 999 in hexadecimal but it has to be rounded after 52 bits, which can change the final bits. >> num=-1/5 num = bfc999999999999a We parse this as or 1 }{{} sign 1011 1111 1100 1001 1001 1010, 01111111100 }{{} exponent 1001100110011001 10011010. }{{} mantissa

Special Values If you start with a positive value x and keep doubling, you will eventually exceed the largest representable value. (Overflow.) The result will appear as Inf. If instead you keep dividing by 2, you will eventually get a result smaller than the smallest representable positive number. (Underflow.) The result will appear as 0. You should note that in the scheme described above, 0 is not actually representable, since it cannot be written in the form 1.xxxx 2 exp. However the bit pattern 00 0 is reserved to represent this value. If you try, say, to evaluate 0/0, you will not get an error message. Instead the result will be represented by a special bit pattern, which appears as NaN ( Not A Number ). A measure of the precision of the system is given by the smallest positive number x such that 1 + x gives a different value than 1. This is called machine epsilon. Note that this is NOT the same thing as the smallest representable positive number.

In the IEEE standard, the exponent can vary between 2 10 and 2 10 1 meaning that the maximum value representable is on the order of 2 1000 10 300, and the minimum representable positive value about 10 300. (But there s a catch here see Assignment 2.) Machine epsilon, on the other hand, is 2 52 2 10 16. If you add two numbers and one is more than 10 16 times the other, the smaller number will have no effect on the sum. See the floatgui application in Moler s book for a look at the floating-point system with toy parameters (small precision and small range on the exponent).

Loss of Precision When MATLAB performs an arithmetic operation on two floating-point numbers, it computes the result precisely, then rounds it to the nearest value representable with the given precision. This leads to some anomalous results. For instance, try evaluating 1/10+1/10+1/10-3/10 in MATLAB. After many iterations, precision can erode. One particular situation to watch out for is addition of two quantities of very different magnitudes. Most of the bits of precision of the smaller value will be lost. Another is catastrophic cancellation, when you take the difference of two quantities of nearly equal precision.

Example of Catastrophic Cancellation Let s plot the polynomial (x 1) 8 in an interval about 1: >> ezplot(@(x)(x-1).^8,[0.99,1.01])

This is what you d expect.

Example of Catastrophic Cancellation Suppose you expanded the polynomial and wrote it in the form: x 8 8x 7 + 28x 6 56x 5 + 70x 4 56x 3 + 28x 2 8x + 1 >> f=@(x)(x.^8-8*x.^7+28*x.^6-56*x.^5+70*x.^4-56*x.^3+ 28*x.^2-8*x+1); >> ezplot(f,[0.99,1.01])

The wild oscillation in the graph is due entirely to magnification of roundoff error when we find the difference between two nearly equal large quantities.