Real Numbers. range -- need to deal with larger range, not just [not just -2^(n-1) to +2^(n-1)-1] - large numbers - small numbers (fractions)

Similar documents
Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating Point Numbers

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

Representing and Manipulating Floating Points

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

15213 Recitation 2: Floating Point

Floating Point Numbers

Floating Point Numbers

Floating Point January 24, 2008

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Data Representation Floating Point

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

EE 109 Unit 19. IEEE 754 Floating Point Representation Floating Point Arithmetic

ECE232: Hardware Organization and Design

Data Representation Floating Point

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

Representing and Manipulating Floating Points

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Foundations of Computer Systems

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

System Programming CISC 360. Floating Point September 16, 2008

Giving credit where credit is due

Today: Floating Point. Floating Point. Fractional Binary Numbers. Fractional binary numbers. bi bi 1 b2 b1 b0 b 1 b 2 b 3 b j

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

CS 33. Data Representation (Part 3) CS33 Intro to Computer Systems VIII 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Giving credit where credit is due

EE 109 Unit 20. IEEE 754 Floating Point Representation Floating Point Arithmetic

Floating Point Arithmetic

Representing and Manipulating Floating Points. Jo, Heeseung

The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive.

Representing and Manipulating Floating Points

Computer Organization: A Programmer's Perspective

Divide: Paper & Pencil

Integers and Floating Point

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

Description Hex M E V smallest value > largest denormalized negative infinity number with hex representation 3BB0 ---

CS429: Computer Organization and Architecture

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

COMP2611: Computer Organization. Data Representation

Finite arithmetic and error analysis

Floating Point Representation. CS Summer 2008 Jonathan Kaldor

3.5 Floating Point: Overview

IEEE Standard for Floating-Point Arithmetic: 754

Floating Point Arithmetic

Data Representation Floating Point

Floating Point Numbers

Number Systems. Binary Numbers. Appendix. Decimal notation represents numbers as powers of 10, for example

Floating Point. EE 109 Unit 20. Floating Point Representation. Fixed Point

Systems Programming and Computer Architecture ( )

Written Homework 3. Floating-Point Example (1/2)

CS356: Discussion #3 Floating-Point Operations. Marco Paolieri

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

Floating Point Representation in Computers

INTEGER REPRESENTATIONS

FLOATING POINT NUMBERS

C NUMERIC FORMATS. Overview. IEEE Single-Precision Floating-point Data Format. Figure C-0. Table C-0. Listing C-0.

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

CO212 Lecture 10: Arithmetic & Logical Unit

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Number Systems Standard positional representation of numbers: An unsigned number with whole and fraction portions is represented as:

Floating-point representations

Floating-point representations

VTU NOTES QUESTION PAPERS NEWS RESULTS FORUMS Arithmetic (a) The four possible cases Carry (b) Truth table x y

Floating Point Numbers. Lecture 9 CAP

The course that gives CMU its Zip! Floating Point Arithmetic Feb 17, 2000

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COSC 243. Data Representation 3. Lecture 3 - Data Representation 3 1. COSC 243 (Computer Architecture)

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Introduction to Computer Systems Recitation 2 May 29, Marjorie Carlson Aditya Gupta Shailin Desai

Midterm Exam Answers Instructor: Randy Shepherd CSCI-UA.0201 Spring 2017

Floating Point Arithmetic

IEEE-754 floating-point

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

Number Systems. Decimal numbers. Binary numbers. Chapter 1 <1> 8's column. 1000's column. 2's column. 4's column

Chapter 4: Data Representations

M1 Computers and Data

Floating Point Arithmetic

Numeric Encodings Prof. James L. Frankel Harvard University

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Chapter 3 Arithmetic for Computers (Part 2)

Inf2C - Computer Systems Lecture 2 Data Representation

Floating Point COE 308. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

MIPS Integer ALU Requirements

Computer Arithmetic Floating Point

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

COMP2121: Microprocessors and Interfacing. Number Systems

9/3/2015. Data Representation II. 2.4 Signed Integer Representation. 2.4 Signed Integer Representation


The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Transcription:

Real Numbers range -- need to deal with larger range, not just [not just -2^(n-1) to +2^(n-1)-1] - large numbers - small numbers (fractions)

Real Numbers von Neumann recommended using fixed point with the programmer doing the scaling fixed point = integer part <assumed binary point> fraction part use positional representation to convert a binary fraction to a decimal value. d_-1 d_-2 d_-3... = sum over i [ d_i * 2^i ]

Real Numbers use positional representation to convert a binary fraction to a decimal value. d_-1 d_-2 d_-3... = sum over i [ d_i * 2^i ].011 (base 2) = 0 * 2^(-1) + 1 * 2^(-2) + 1 * 2^(-3) = 0 * 1/2 + 1 * 1/4 + 1 * 1/8 = 0 * 0.5 + 1 * 0.25 + 1 * 0.125 = 0 + 0.25 + 0.125 = 0.375

Real Numbers converting a decimal fraction to a binary fraction: repeatedly multiply decimal fraction by two, taking the integer part 0 or 1 as bits of the fraction from left to right, stop when fraction is zero e.g., 0.375 (base 10) 0.375 x 2 = 0.75 integer parts are 0, 1, 1 0.75 x 2 = 1.5 binary fraction is.011 0.5 x 2 = 1.0 ^stop

Real Numbers note that a terminating decimal fraction may be nonterminating when e.g., 0.1 base 10 = 0.00011001100... base 2

Floating Point automatic scaling by hardware like scientific notation, automatic scaling by hardware requires an exponent field (i.e., the scale factor) and a mantissa (i.e., 1.fraction) one sign bit range is governed by the number of bits in the exponent precision is governed by the number of bits in the mantissa

Floating Point normal form - number represented as 1.fraction times an appropriate exponent s exp fraction value = (-1)^s x 1.fraction x 2^exp exponent uses bias notation so that negative exponents have leading 0s (historical reason is that integer compare operations could be used on this type of floating-point representation)

Floating Point consider a 2-bit exponent using bias 2 (binary 10) zero exp means zero 01 exp means (01 base 2 - bias 10 base 2) = 2-1 = 1/2 10 exp means (10 base 2 - bias 10 base 2) = 2 0 = 1 11 exp means (11 base 2 - bias 10 base 2) = 2 1 = 2 and a 2-bit fraction w/ leading 1. hidden for sign=0 (positive numbers): 0 00 00 = 0.0 0 01 00 = 1.00(base 2) x 2^(01-10 base 2) = 1.00 x 2-1 = 0.500 0 01 01 = 1.01(base 2) x 2^(01-10 base 2) = 1.25 x 2-1 = 0.625 0 01 10 = 1.10(base 2) x 2^(01-10 base 2) = 1.50 x 2-1 = 0.750 0 01 11 = 1.11(base 2) x 2^(01-10 base 2) = 1.75 x 2-1 = 0.875

Floating Point for sign=0 (positive numbers): 0 10 00 = 1.00(base 2) x 2^(10-10 base 2) = 1.00 x 2 0 = 1.00 0 10 01 = 1.01(base 2) x 2^(10-10 base 2) = 1.25 x 2 0 = 1.25 0 10 10 = 1.10(base 2) x 2^(10-10 base 2) = 1.50 x 2 0 = 1.50 0 10 11 = 1.11(base 2) x 2^(10-10 base 2) = 1.75 x 2 0 = 1.75 0 11 00 = 1.00(base 2) x 2^(11-10 base 2) = 1.00 x 2 1 = 2.0 0 11 01 = 1.01(base 2) x 2^(11-10 base 2) = 1.25 x 2 1 = 2.5 0 11 10 = 1.10(base 2) x 2^(11-10 base 2) = 1.50 x 2 1 = 3.0 0 11 11 = 1.11(base 2) x 2^(11-10 base 2) = 1.75 x 2 1 = 3.5

Floating Point representing the preceding points on the number line: 2-1 2 0 2 1 / \ / \ / \ 0------- - - - -*---*---*---*---#-------#-------#-------#----....5.75 1 1.25 1.75 2 2.5 3 3.5.625.875 1.5 notice the geometric increase in spacing between representable numbers (0.125, then 0.25, then 0.5)

Floating Point precision - how many digits are used to represent a number accuracy - how correct the number is; e.g., 4.56789 is a number with six-digit precision but rather inaccurate as an approximation to pi, while 3 is less precise but more accurate rounding - choose nearest representable neighbor

Floating Point while absolute error in representation increases as the numbers get larger, relative error remains the same, e.g., consider representing numbers one fourth of the distance between the "pickets" in the "picket fence" given on the previous slide A B 0------- - - - -*.--*---*---*---#-.-----#-------#-------#----....5.75 1 1.25 1.75 2 2.5 3 3.5 A = 1.0625 will be represented by 1.0, so absolute error = 0.0625, relative error = 0.0625/1.0625 = 0.0588 B = 2.125 will be represented by 2.0, so absolute error = 0.125,relative error = 0.125/2.125 = 0.0588

Floating Point the maximum relative error in nonzero representations will occur for a number that is halfway between pickets, thus for a binary mantissa, the relative error will be bounded by - (number of bits in fraction + 1) 2 for a number that gets truncated or rounded to zero, the maximum relative error can reach a factor of 1, e.g., if 0.1 is represented as 0 denormal numbers are used for gradual underflow rather than just truncating values less than 2 (minimum exponent) to zero

Floating Point in the previous example for 0.1, use 00 exp for smallest exp value (2-1 ), but with no hidden bit; thus, assign 0 00 01=0.125, 0 00 10=0.250, and 0 00 11=0.375.5 1 1.5 2 2.5 3 3.5 0-.-.-.- - - - -*---*---*---*---#-------#-------#-------# ^ ^ ^.375.25.125

Floating Point a five-bit floating-point format with denormals yields 32 patterns 0 00 00 = zero 0 00 01 = 0.125 0 00 10 = 0.25 0 00 11 = 0.375 0 01 00 = 0.5 0 01 01 = 0.625 0 01 10 = 0.75 0 01 11 = 0.875 0 10 00 = 1.0 0 10 01 = 1.25 0 10 10 = 1.5 0 10 11 = 1.75 0 11 00 = 2.0 0 11 01 = 2.5 0 11 10 = 3.0 0 11 11 = 3.5 1 00 00 = -zero 1 00 01 = -0.125 1 00 10 = -0.25 1 00 11 = -0.375 1 01 00 = -0.5 1 01 01 = -0.625 1 01 10 = -0.75 1 01 11 = -0.875 1 10 00 = -1.0 1 10 01 = -1.25 1 10 10 = -1.5 1 10 11 = -1.75 1 11 00 = -2.0 1 11 01 = -2.5 1 11 10 = -3.0 1 11 11 = -3.5 relative error bound of 1/2 (number of bits in fraction + 1) = 0.125 holds down to the smallest normalized number 2 min_exp = 0.5

Floating Point This table also shows the encoding of the 2-bit exponent and 2- bit fraction fraction 4 3 2 1 0 s exp frac 00 01 10 11 exponent 00 0.0 0.125 0.25 0.375 2-1 01 0.5 0.625 0.75 0.875 2 0 10 1.0 1.25 1.5 1.75 2 +1 11 2.0 2.5 3.0 3.5

Floating Point Operations addition -- shift smaller number to right until exponents are equal, then add 1.5 = 0 10 10 = 1.10x2 0 = 1.10x2 0 +0.875 = + 0 01 11 = + 1.11x2-1 = + 0.11x2 0 2.375 10.01x2 0 = 1.00x2 1 = 2.0 = 0 11 00 note loss of bits when shifting to match exponents and when truncating to fit into 5-bit format (absolute error of 0.375, relative error of 15.8%)

Floating Point Operations error magnification in case of effective subtraction of two approximately equal numbers [that is, a-b or a+(-b), where a and b are approx. equal]; this is called catastrophic cancellation; the leading (most significant) bits cancel out and all that is left are the trailing (least significant) bits, which are most affected by representation errors. 2.00 = 0 11 00 = 1.00x2 1 = 1.00x2 1-1.75 = - 0 10 11 = - 1.11x2 0 = - 0.11x2 1 0.25 0.01x2 1 = 1.00x2-1 = 0.5 = 0 01 00 (absolute error of 0.25, relative error of 100%)

Floating Point Operations to maintain accuracy in effective subtraction, most FP hardware adds three bits: guard bit, round bit, and sticky bit e.g., error in the previous subtraction eliminated with addition of guard bit 2.00 = 0 11 00 = 1.00x2^1 = 1.000x2^1-1.75 = - 0 10 11 = - 1.11x2^0 = - 0.111x2^1 0.25 0.001x2^1 = 0.10x2^(-1) = 0.25 = 0 00 10 (these extra bits would also help in the first addition above to produce a result of 2.5, which would then have less absolute error, 0.125, and less relative error, 5.3%) g

Floating Point Operations the extra bits also help to implement round to even fraction bits g r s x x... x x x 0 0 1 \... round down (truncate) x x... x x x 0 1 1 x x... x x 0 1 0 0 / (0100 round to even -> truncate) x x... x x 1 1 0 0 \ (1100 round to even -> add one to last x x... x x x 1 0 1 bit in fraction)... round up x x... x x x 1 1 1 /

Floating Point Operations In summary, to round to the nearest even using G, R and S: 0xx - round down = do nothing (x means any bit value) 100 - this is a tie: round up if the mantissa's lsb is 1, else round down=do nothing 101 - round up 110 - round up 111 - round up

Floating Point Operations multiplication -- multiply 1.fraction parts, add exponents 1.5 = 0 10 10 = 1.10x2^0 *1.5 = * 0 10 10 = * 1.10x2^0 2.25 10.01x2^0 = 1.00x2^1 = 2.0 = 0 11 00 or rounding up 10.01x2^0 = 1.001x2^1 = 1.01x2^1 = 2.5 = 0 11 01 using extra bit

IEEE Standards first floating point support was on IBM 704 in 1950s; each manufacturer had its own FP format until standardization by IEEE in 1980s IEEE formats single precision - 32-bit format = 1-bit sign, 8-bit exp, 23-bit fraction double precision - 64-bit format = " also extended precision - 80-bit format quad precision - 128-bit format, 11-bit exp, 52-bit fraction

IEEE Standards special codes for NaN - Not a Number (propagates itself through any operation) infinity - (also propagates in most cases) denormal numbers

IEEE Standards IEEE single-precision floating-point encoding (exponent is bias-127) 23-bit fraction 8-bit 000...00 000...01... 111...11 exponent +---------------------------------------- 000...00 0 2^(-149) almost 2^(-126) (zero and denormals) 000...01 2^(-126)... almost 2^(-125)............ 011...10 0.5...... 011...11 1.0...... 100...00 2.0.................. 111...10 2^(+127)... almost 2^(+128) 111.1111 inf NaN... NaN (infinity and NaNs) "almost" means that instead of exactly two, the significand = 2-2^(-23)

Examples consider this C program int main(void){ float a=2.5, b=0.1, c; int *pa=(int *)&a,*pb=(int *)&b,*pc=(int *)&c; c = a + b; printf("a = %f (0x%x)\n",a,*pa); printf("b = %f (0x%x)\n",b,*pb); printf("c = %f (0x%x)\n",c,*pc); } it prints the float and hex values (hex requires dereferencing an int pointer) a = 2.500000 (0x40200000) b = 0.100000 (0x3dcccccd) c = 2.600000 (0x40266666)

Examples next, consider this C program int main(void){ int i; float sum; sum = 0.0; for(i=1; i<=10000000; i++){ sum = sum + 1.0/((float)i); } printf("decreasing order: %f\n",sum); sum = 0.0; for(i=10000000; i>0; i--){ sum = sum + 1.0/((float)i); } printf("increasing order: %f\n",sum); } when run, it prints decreasing order: 15.403683 increasing order: 16.686031 can you explain why? what is the correct answer?

Examples next, this program int main(void){ double hd, xd, yd; float hs, xs, ys; int i; hs = 0.1; xs = 0.0; for(i=0;i<10;i++){ xs += hs; } ys = 1.0 - xs; hd = 0.1; xd = 0.0; for(i=0;i<10;i++){ xd += hd; } yd = 1.0 - xd; printf("left-overs: %g %g\n",ys,yd); return 0; } prints "left-overs: -1.19209e-07 1.11022e-16" -- note the difference in the signs. Can you explain why? for the single precision case, ten adds of 0.1 is slightly larger than 1

Examples for the single precision case, ten adds of 0.1 is slightly larger than 1 32b SP rep 28b addition (carry, hidden, fraction, grs) 1.0 = 0x3f800000 0x04000000 - xs = 0x3f800001-0x04000008 ----- absolute difference = 0x00000008 (grs bits = 0x0) ys 0xb4000000 (normalize by 23 left shifts) ys = - 1.0 * 2^(-23) = -1/8,388,608 = -1.19209e-07

Examples double sub(int a) { register double x; x = a; return(x); } which, if you compile (only) with "armc -O2 -S", gives (edited) in the.s file:.global aeabi_i2d.text.align 2.global sub.type sub, %function sub: stmfd sp!, {r3, lr} bl aeabi_i2d ldmfd sp!, {r3, pc} where, double aeabi_i2d(int) is a standard ARM function to convert integer to double)

Extracting and Printing Fields and Values Finally, to extract and print the fields and values of a float representation. /* display fields of normalized, non-zero, single-precision float */ /* -- does not range test to verify input value assumptions */ void show_flt( unsigned int a ) { union{ unsigned int i; float f; } v; unsigned int sign, exp, fraction, significand; int int_exp; sign = ( a >> 31 ) & 1; /* sign is leftmost bit */ exp = ( a >> 23 ) & 0xff; /* exp is next 8 bits */ int_exp = ( (int) exp ) - 127; /* exp is encoded bias-127 */ fraction = a & 0x007fffff; /* fraction is last 23 bits */ significand = fraction 0x00800000; /* assume a normalized value */ /* so insert hidden bit */ v.i = a; printf( "%4.1f, 0x%08x, ", v.f, a ); printf( "sign %x, hex exp %x, int exp %+d, hex significand %x\n", sign, exp, int_exp, significand ); }

Extracting and Printing Fields and Values (continued) int main() { union{ float f; unsigned int i; } a; a.f = 10.0; show_flt( a.i ); a.f = 2.0; show_flt( a.i ); a.f = 1.0; show_flt( a.i ); a.f = 0.5; show_flt( a.i ); a.f = 0.1; show_flt( a.i ); } return 0; Note that C will promote (coerce) floats to doubles when passing to printf() and other library routines, so you need to use a union (or int pointer deref).