3.5 Floating Point: Overview

1 3.5 Floating Point: Overview Floating point (FP) numbers Scientific notation Decimal scientific notation Binary scientific notation IEEE 754 FP Standard Floating point representation inside a computer Greater range vs. precision Decimal to Floating Point conversion Type is not associated with data MIPS floating point instructions, registers Computer Numbers Computers are made to deal with numbers What can we represent in n bits? Unsigned integers: 0 to 2 n  1 Signed integers: 2 (n1) to 2 (n1)  1 What about other numbers? Very large numbers? (seconds/century) 3,155,760, ( x 10 9 ) Very small numbers? (atomic diameter) ( x 108 ) Rationals (repeating pattern) 2/3 ( ) Irrationals: 2 1/2 ( ) Transcendentals: e ( ), π ( ) 1
2 mantissa Scientific Notation exponent 6.02 x radix (base) decimal point Normalized form: no leadings 0s (exactly one digit to left of decimal point) Alternatives to representing 1/1,000,000,000 Normalized: 1.0 x 109 Not normalized: 0.1 x 108, 10.0 x Binary Scientific Notation Mantissa Exponent 1.0 two x 21 radix (base) binary point Floating point arithmetic Binary point is not fixed (as it is for integers) Declare such variable in C as float or double 2
3 FP Decimal FP Binary Dec to Bin: Fraction repeatedly multiplied by x x x FP Binary => Decimal => x = 1x x x x x x x24 = Floating Point Representation (single precision) Use a Word (32 bits) Normal format: +1.xxxxxxxxxx two *2 yyyy two S Exponent Fraction 1 bit 8 bits 23 bits 0 S represents Sign, Exponent represents y s Fraction represents x s Represent numbers as small as 2.0 x to as large as 2.0 x C variable declared as float 3
4 Overflow and Underflow Overflow Result is too large (> 2.0x10 38 ) Exponent larger than represented in 8bit Exponent field Underflow Result is too small >0, < 2.0x1038 Negative exponent larger than represented in 8 bit Exponent field How to reduce chances of overflow or underflow? Double Precision FP Use two words (64 bits) 31 S Exponent Fraction 1 bit 11 bits 20 bits 0 Fraction (cont d) 32 bits C variable declared as double Represent numbers almost as small as 2.0 x to almost as large as 2.0 x Primary advantage is greater accuracy (52 bits) 4
5 IEEE 754 Exponent Use FP numbers even without FP hardware Sort records with FP numbers using integer compares Break FP number into 3 parts: compare signs, then compare exponents, then compare fractions Faster (single comparison, ideally) Highest order bit is sign ( negative < positive) Exponent next, so big exponent => bigger # Fraction last: exponents same => bigger # Floating Point Representation Normalized scientific notation: +1.xxxx two *2 yyyy two Single Precision 31 S Exponent Fraction 1 bit 8 bits 23 bits 0 Double Precision S Exponent Fraction 1 bit 11 bits 20 bits Fraction (cont d) 32 bits 0 Exponent: biased notation Fraction: sign magnitude notation Bias 127 (SP) 1023 (DP) 5
6 IEEE 754 FP Standard Used in almost all computers (since 1980) Porting of FP programs Quality of FP computer arithmetic Sign bit: 1 means negative 0 means positive Fraction / Significand: Leading 1 implicit for normalized numbers Significand = 1 + fraction ( bits single, bits double, i.e. 24 bits for single, 53 bits for double) always true: 0 < Fraction < 1 0 has no leading 1 Reserve exponent value 0 just for number 0 (1) S * (1 + Fraction) * 2 Exp Biased Notation for Exponents Two s complement does not work for exponent Most negative exponent: two Most positive exponent: two Bias: number added to real exponent 127 for single precision 1023 for double precision 1.0 * (1) S * (1 + Fraction) * 2 (Exponent  Bias) 6
7 Binary to Decimal FP Sign: 0 => positive Exponent: two = 104 ten Bias adjustment: = 23 Significand: 1 + 1x x x x x = = Represents: *223 ~ 1.986*107 Decimal to Binary FP Binary FP representation of = two Normalized to 1.1 two x 21 (1) S x (1 + Fraction) x 2 (Exponent127) (1) 1 x ( ) x 2 (126)
8 Decimal to Binary x x x => x 2 1 Fraction: Sign: negative => 1 Exponent: = 128 ten = two Types and Data * ,003,010 4UCB ori $s5, $v0, Data can be anything; operation of instruction that accesses operand determines its type! Power/danger of unrestricted addresses/pointers: Use ASCII as FP, instructions as data, integers as instructions,... Security holes in programs 8
9 Special Values Negative Overflow Negative Underflow Expressible Negative Numbers Positive Underflow Expressible Positive Numbers Positive Overflow ( )* * *2127 ( )*2 128 Special Value Exponent Fraction +/ Denormalized number Nonzero NaN Nonzero +/ infinity Not a Number What is the result of: sqrt(4.0)or 0/0? If infinity is not an error, these shouldn t be either. Called Not a Number (NaN) Exponent = 255, Fraction nonzero Applications NaNs help with debugging They contaminate: op(nan, X) = NaN Don t use it 9
10 FP Addition / Subtraction Much more difficult than with integers Can t just add fractions Algorithm Denormalize to match exponents Add (subtract) significands to get resulting one Keep the same exponent Normalize (possibly changing exponent) Note: If signs differ, just perform a subtract instead. FP Addition Algorithm 10
11 Example: ( ) 0.5 = 0.1 = x = = 1.110x Shift right the signicand of number with smaller exponent so that the smaller exponent equals the exponent of the other number x 22 = 0.111x 21 Floating Point Hardware 11
12 Example: 0.5 x ( ) 0.5 = 0.1 = x = = 1.110x 22 Rounding with Guard Digits To maintain accuracy in rounding, IEEE 754 uses two extra bits, guard and round Example: 2.56x x10 2 Without guard and round digits = 2.36x10 2 With guard digits = x10 2 = 2.37x
13 FP Fallacy FP Add, subtract associative: FALSE! o x = 1.5 x 10 38, y = 1.5 x 10 38, and z = 1.0 o x + (y + z) = 1.5x (1.5x ) = 1.5x (1.5x10 38 ) = 0.0 o (x + y) + z = ( 1.5x x10 38 ) = (0.0) = 1.0 Floating Point add, subtract are not associative! Why? FP result approximates real result 1.5 x is so much larger than 1.0 that 1.5 x in floating point representation is still 1.5 x MISP FP Architecture (1/2) Separate floating point instructions: Single Precision: add.s, sub.s, mul.s, div.s Double Precision: add.d, sub.d, mul.d, div.d These instructions are far more complicated than their integer counterparts Problems: It s inefficient to have different instructions take vastly differing amounts of time. Generally, a particular piece of data will not change from FP to int, or vice versa, within a program. Some programs do not do floating point calculations It takes lots of hardware relative to integers to do FP fast 13
14 MISP FP Architecture (2/2) 1990 Solution: separate chip that handles only FP. Coprocessor 1: FP chip Contains bit registers: $f0, $f1, Most registers specified in.s and.d instructions ($f) Separate load and store: lwc1 and swc1 ( load word coprocessor 1, store ) Double Precision: by convention, even/odd pair contain one DP FP number: $f0/$f1,, $f30/$f Computer contains multiple separate chips: Processor: handles all the normal stuff Coprocessor 1: handles FP and only FP; Move data between main processor and coprocessors: mfc0, mtc0, mfc1, mtc1, etc. C => MIPS (Fahrenheit to Celsius) Float f2c (float fahr) { return ((5.0 / 9.0) * (fahr 32.0)); } F2c: lwc1 $f16, const5($gp) # $f16 = 5.0 lwc1 $f18, const9($gp) # $f18 = 9.0 div.s $f16, $f16, $f18 # $f16 = 5.0/9.0 lwc1 $f20, const32($gp) # $f20 = 32.0 sub.s $f20, $f12, $f20 # $f20 = fahr 32.0 mul.s $f0, $f16, $f20 jr $ra # return # $f0 = (5/9)*(fahr32) 14
15 Rounding Math on real numbers => rounding Rounding also occurs when converting types Double single precision integer IEEE 754 has 4 rounding options Round towards +infinity ALWAYS round up : => 3; => 2 Round towards infinity ALWAYS round down : => 1; => 2 Truncate Just drop the last bits (round towards 0) Round to (nearest) even (default) 2.5 => 2; 3.5 => 4 Rounding with Guard Digits To maintain accuracy in rounding, IEEE 754 uses two extra bits, guard and round Remember, all floatingpoint numbers are an approximation of a more accurate number (they are an approximation of a number that has been rounded to an infinite number of significant digits to a number that is representable in a machine) Example: 2.56x x10 2 With guard digits = x10 2 = 2.37x10 2 Without guard and round digits = 2.36x
16 Conclusion Floating Point numbers approximate values that we want to use. IEEE 754 Floating Point Standard is most widely accepted attempt to standardize FP arithmetic New MIPS architectural elements Registers ($f0$f31) Single Precision (32 bits, 2x x10 38 ) add.s, sub.s, mul.s, div.s Double Precision (64 bits, 2x x ) add.d, sub.d, mul.d, div.d Type is not associated with data, bits have no meaning unless given in context (e.g., int vs. float) 16
More information