ECE232: Hardware Organization and Design Lecture 11: Floating Point & Floating Point Addition Adapted from Computer Organization and Design, Patterson & Hennessy, UCB
Last time: Single Precision Format Note that the exponent has no explicit sign bit Base? 32 bits M: Mantissa (23 bits) E: Exponent (8 bits) S: Sign of mantissa (1 bit) ECE331: Floating Point 2
Last time: Normalization The mantissa M is a normalized fraction Has an implied decimal place on left Has an implied (hidden) 1 on left of the decimal place E.g., Fraction 10100000000000000000000 Represents 1.101 2 = 1.625 10 The significand=1.f is in the range [1, 2-ulp] ulp unit in the last position (what remains to reach a whole number when all bits are set to one) F ECE331: Floating Point 3 S = ( 1) 1. f 2 E Bias Value of exponent (unsigned integer) Bias value (known; set by convention)
Binary Fractions To convert binary fractions to floating point 0.1110 0010 2-1 2-2 2-3 2-7 = 1*(0.5) + 1*(0.25) + 1*(0.125) + 0*(0.0625) + 0*(0.03125) + 0*(0.015625) + 1*(0.0078125) + 0*(0.00390625) = 0.8828125000 ECE331: Floating Point 4
Binary Fractions To convert floating point to binary 9.625 whole number 9 à 1001 2 fraction 0.625 * 2 = 1.25 = 1 + 0.25 0.25 * 2 = 0.50 0.50 * 2 = 1 1 0 1 9.625 à 1001.1010 0000 2 = 0.1001 1010 0000 x 2 4 = 1.0011 0100 0000 x 2 3 Note that we can shift positions left and right of the decimal point by multiplying by different powers of 2 ECE331: Floating Point 5
Floating-Point Example To convert floating point number to binary Represent 0.75 0.75 = 0.11 2 = ( 1) 1 1.1 2 2 1 S = 1 Fraction = 1000 00 2 Exponent = 1 + Bias Single: 1 + 127 = 126 = 01111110 2 Double: 1 + 1023 = 1022 = 01111111110 2 Single: 1 01111110 1000 00 fraction sign exponent and bias Double: 1 01111111110 1000 00 ECE331: Floating Point 6
Floating-Point Example To convert from binary to floating point What number is represented by the single-precision float? 1 10000001 01000 00 Identify the components S = 1 Fraction = 01000 00 2 Exponent = 10000001 2 = 129 Calculate the value x = ( 1) 1 (1 + 0.01 2 ) 2 (129 127) = ( 1) 1.25 2 2 = 5.0 ECE331: Floating Point 7
Floating-Point Addition in Decimal Consider a 4-digit decimal example 99.99 + 0.1610 In scientific notation: 9.999 10 1 + 1.610 10 1 1. Align decimal points Shift number with smaller exponent 9.999 10 1 + 0.016 10 1 2. Add significands 9.999 10 1 + 0.016 10 1 = 10.015 10 1 3. Normalize result & check for over/underflow 1.0015 10 2 4. Round and renormalize if necessary 1.002 10 2 5. Optionally convert back to non-exponential form 100.2 In this example, accurate up to 4 decimal places ECE331: Floating Point 8
Floating-Point Addition in binary Now consider a 4-digit binary example 0.1000 + (-0.0111) 1.000 2 2 1 + 1.110 2 2 2 (0.5 + 0.4375) 1. Align binary points Shift number with smaller exponent 1.000 2 2 1 + 0.111 2 2 1 2. Add significands 1.000 2 2 1 + 0.111 2 2 1 = 0.001 2 2 1 3. Normalize result & check for over/underflow 1.000 2 2 4, with no over/underflow 4. Round and renormalize if necessary 1.000 2 2 4 (no change) = 0.0625 5. When converted to floating point representation: Mantissa, f = 000 0000 0000 0000 0000 0000 (23 bits of zeros) Exponent: E Bias = -4; if Bias = 127, then E = 123 10 E = 0111 1011 2 Sign bit, S = 0 The entire 32-bit floating point representation in binary 0 0111 1011 000 0000 0000 0000 0000 0000 ECE331: Floating Point 9
An0ther Single precision example 0 10000010 11010000000000000000000 1.1101 2 130 127 = 3 0 = positive mantissa +1.1101 2 x 2 3 = 1110.1 2 = 14.5 10 ECE331: Floating Point 10
Converting to IEEE format Example - decimal number: -3.154 X 10 0 What is the sign? What is the exponent? What is the mantissa? Converting Mixed Numbers Decimal to Binary How we do it in decimal 456.78 10 = 4 x 10 2 + 5 x 10 1 + 6 x 10 0 + 7 x 10-1 +8 x 10-2 How it is done in binary 1011.11 2 = 1 x 2 3 + 0 x 2 2 + 1 x 2 1 + 1 x 2 0 + 1 x 2-1 + 1 x 2-2 = 8 + 0 + 2 + 1 + 1/2 + ¼ = 11 + 0.5 + 0.25 = 11.75 10 ECE331: Floating Point 11
How to convert whole Decimal to Binary Successive division by 2 5714310 = 11011111001101112 value remainder 1 1 1 3 0 6 1 13 1 27 1 Start dividing by 2 here and move upward 55 1 111 1 223 0 446 0 892 1 1785 1 3571 0 7142 1 14285 1 28571 1 57143 Binary value read downwards ECE331: Floating Point 12
Converting fractional Decimal to Binary Successive multiplication by 2 0 0.154 1 0.308 0 2 0.616 0 3 1.232 1 4 0.464 0 5 0.928 0 6 1.856 1 7 1.712 1 8 1.424 1 9 0.848 0 10 1.696 1 11 1.392 1 12 0.784 0 13 1.568 1 14 1.136 1 15 0.272 0 16 0.544 0 17 1.088 1 18 0.176 0 19 0.352 0 20 0.704 0 21 1.408 1 22 0.816 0 23 1.632 1 Decimal 0.154 =.0010 0111 0110 1100 1000 101 ECE331: Floating Point 13
Floating Point Special Representations S E 127 F = ( 1) 1. f 2 1 1. f < 2 There are two Zeroes, ±0, and two Infinities ± NaN (Not-a-Number) may have a sign and have a non-zero fraction - used for program diagnostics NaNs and Infinities have all 1s in the Exp field, E=255. F+ =, F/ =0 ECE331: Floating Point 14 Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002
Floating Point Special Representations S E 127 F = ( 1) 1. f 2 1 1. < 2 f 1 E 254 (single precision) Single Precision Double Precision Object represented Exponent Fraction Exponent Fraction 0 0 0 0 0 0 nonzero 0 nonzero ± denormalized number 1-254 Anything 1-2046 Anything ± floating point number 255 0 2047 0 ± infinity 255 nonzero 2047 nonzero NaN (not a number) ECE331: Floating Point 15
Smallest & Largest Numbers The smallest non-zero positive and largest non-zero negative normalized numbers (represented by 1 in the Exp field and 0 0 in the Fraction field) are ±2 126 ±1.175494351 10 38 The smallest non-zero positive and largest non-zero negative denormalized numbers (represented by all 0s in the Exp field and 0 01 in the Fraction field) are ±2 149 ±1.4012985 10 45 The largest finite positive and smallest finite negative numbers (represented by 254 in the Exp field and 1 1 in the Fraction field) are ±(2)(2 127 ) ±3.40 10 38 ECE331: Floating Point 16
FP Adder Hardware Step 1 Step 2 Step 3 Step 4 ECE331: Floating Point 17
Single Precision Summary Type Exponent Mantissa Value Zero 0000 0000 000 0000 0000 0000 0000 0000 0 One 0111 1111 000 0000 0000 0000 0000 0000 1 Denormalized number 0000 0000 100 0000 0000 0000 0000 0000 5.9 10-39 Largest normalized number 1111 1110 111 1111 1111 1111 1111 1111 3.4 10 38 Smallest normalized number 0000 0001 000 0000 0000 0000 0000 0000 1.18 10-38 Infinity 1111 1111 000 0000 0000 0000 0000 0000 Infinity NaN 1111 1111 010 0000 0000 0000 0000 0000 NaN ECE331: Floating Point 18
Summary Floating point numbers represent large numbers with fractions Number formats are different than 2 s complement. Requires some memorization Addition requires aligning, adding, and then realigning Do examples! The best way to learn floating point operations ECE331: Floating Point 19