Floating-point representations

Lecture 10 Floating-point representations Methods of representing real numbers (1) 1. Fixed-point number system limited range and/or limited precision results must be scaled 100101010 1111010 100101010.1111010 2. Rational number system difficult arithmetic operations 1001010101 0110001011 Number = numerator denominator 1001010101 0110001011 Methods of representing real numbers (2) 3. Logarithmic number system 1 10010101010 sign logarithm wide range low precision Number = - 2 1001010.1010 2 4. Floating-point number systems 1

Fixed-point vs. floating point representation (1) Fixed-point ulp 0000 0000. 0000 1001 2 1001 0000. 0000 0000 2 Absolute representation error ulp with truncation ulp/2 with rounding Relative representation error large small Absolute representation error constant Relative representation error variable - large for small numbers - small for large numbers Fixed-point vs. floating point representation (2) Floating-point 1.001 2-5 1.001 2 7 Absolute representation error variable - large for large numbers - small for small numbers Relative representation error constant Floating point number (1) number = ± significand base exponent x = ± s b e Typical value of the base b=2 but in the past b=2, 8, 16, 256 Typical range of the significand s [1, 2) Other common choice s [0, 1) Such s is called mantissa 2

Floating point number (2) Exponent e typically represented using biased representation Exponent stored as e + bias where bias = 2 h-1-1, h number of bits of the exponent Extra cost compared to the 2 s complement representation is small because of relatively small sizes of exponents Range of floating point numbers [-max, -min] and [min, max] 0 where max = largest significand largest exponent max = s max b e max min = smallest significand smallest exponent min = s min b e min Current standard ANSI / IEEE Std 754-1985 Single-precision or short format, 32 bits Double-precision or long format, 64 bits 3

Operations on special numbers 0 NaN not-a-number 0 / 0 = NaN + 0 = NaN ± Ordinary Number = ± NaN + Ordinary Number = NaN Denormals 0.f 2-126 hidden 1 replaced by 0 smallest exponent Representation e+bias = 0 s = f 0 Implementation of denormals is optional Extended formats Used internally to reduce the effect of accumulated errors Number of bits Representation Single-precision Exponent Significand Double-precision Exponent Significand normal 8 23+1 11 52+1 extended 11 32+1 15 64+1 4

Unpacking Unpacking & packing numbers 1. Separate sign, exponent, significand 2. Reinstate hidden 1 3. Convert to internal extended format Packing 1. Test for special values 2. Remove hidden 1 3. Combine sign, exponent, and significand Addition (1) (± s1 b e1 ) + (± b e2 ) = = assuming e1 > e2 = = (± s1 b e1 ) + (± b e1 ) = b e1-e2 = (± s1 ± ) b e1 b e1-e2 Allignment shift or preshift e1>e2 s1 1.01110101 1.01110101 e1-e2 positions 0.00001011 Addition (2): Normalization shift or post shift s = ± s1 ± b e1-e2 s1, [1, 2) For the same sign of operands 1 s < 4 A single right shift of s and adding 1 to the exponent might be required For the different signs of operands 0 s < 2 Multiple left shifts of s and decrementing exponent might be necessary 5

Addition (3) If a common case of e1-e2 2 s1 [1, 2) [0, 1/2) s1- (1/2, 2) s1- might need to be shifted left by one position Addition (4) Each significand represented in two s complement notation using at least the following bits z 1 - sign bit c out z 1 z 0. z -1 z -2 z -l G R S G - guard bit: protects against the loss of precision R - round bit: basic bit used for rounding during left shift by one position S - sticky bit: auxiliary bit used for rounding during left shift by one position Addition (4) R - used for rounding in case of one bit left shift performed during normalization if R=0, set z -l to 0, discarded part < ulp/2 R=1 and S 0 set z -l to 1, discarded part < ulp/2 R=1 and S=0 round to the nearest even number discarded part = ulp/2, S - logical OR of all bits shifted through the given position during preshift (alignment shift) 6

Multiplication (± s1 b e1 ) (± b e2 ) = = (± s1 ) b e1+e2 = s b e 1 s1 < 4 A single right shift of s and adding 1 to the exponent might be required If e1 and e2 are of the same sign, multiplication may lead to overflow or underflow Division ± s1 b e1 = ± ± b e2 s1 be1-e2 1 2 s1 < < 2 A single left shift of s and subtracting 1 from the exponent might be required If e1 and e2 are of the opposite signs, multiplication may lead to overflow or underflow Comparisons and exceptions Valid comparisons < 0 + 0 = 0 NaN NaN NaN ± Comparisons that generate exceptions NaN < + NaN > - Operations that generate exceptions ( ) + (+ ) 0 0 / 0 / 7

Rounding schemes (1) Truncation Signed-magnitude representation 2 s complement representation Rounding schemes (2) Round to the nearest (rtn) Signed-magnitude representation 2 s complement representation Rounding schemes (3) Round to the nearest (rtn) Rounding errors 0.00 0 0 0.01 0-0.25 0.10 1 +0.5 0.11 1 +0.25 Average error +0.125 8

Rounding schemes (4) Round to the nearest even (rtne) Signed-magnitude representation 2 s complement representation 9