(252-0061-00) Session 9 Floating Point Systems Group Department of Computer Science ETH Zürich 1
Floating Point Recap for the Assignment 2
Floating Point Representation Numerical Form Scientific Notation Sign s 0, 1 Significant M [1.0, 2.0) Exponent E Encoding Bit Pattern MSB is sign bit s exp E, exp E frac M, exp E F = 1 s M 2 E s exp frac 1 8 23 1 11 52 1 15 63/64 float double extended 3
Casting Integer Types What happens here? 1. unsigned int foo; 2. long bar = (long) foo; Floats What happens here? 1. int i; 2. long long l; 3. float f; 4. double d; 5. i = (int) f; 6. i = (int) d; 7. f = (float) d; 8. d = (double) i; 9. f = (float) f; 4
Floats <-> Integers Casting between floats, doubles and integers generally changes the bit representation! 1. int i = 0xABCDABCD; 2. float f = (float)i; 3. 4. int *i2 = (int *)&f; 5. 6. printf( %x, %x, I, *i2); 7. 8. // Prints // abcdabcd, cea864a8 5
Floats <-> Integers From To Descrption double/float float f=1.12345; float f2=1.999999; int long long l=0x7fffffffffffffff; long long l2=0xffffffff; int (int)f =? (int)f2 =? double double d = (double)l; double d2 = (double)l2; Truncates the fractional part, Out of range, NaN -> TMin In general exact conversion iff int < 54 bits l == (long long)d; l2 == (long long)d2; int Float Will round according to rounding mode float f2=1.50f; float f3=1.50f; printf("%f, %i, %i\n", f2+f3, (int)(f2+f3), (int)f2 + (int)f3); // 3.00000 3 2 6
Normalized / Denormalized Normalized: exp!= {000 0, 111 1} Good for bigger values Not equi-spaced Denormalized: exp == 000 0 Good for very small values Equi-spaced [-1 + eps, 1 - eps] And zero 7
NORMALIZED! Exponent There must be a way to express negative exponents -> Encode as biased value E = Exp Bias Bias = 2 e-1-1: For Single precision? For Double precision? Exponent in general never all zeros and all ones! 8
NORMALIZED! Significant We know that M [1.0, 2.0) We always have one leading 1 Remove that leading 1 to stave one bit! What are the max and min values for the significant? 9
DENORMALIZED Exponent There must be a way to express values very close to 0: exponent must be as negative as possible. Exp is all zero and the exponent is evaluated as E = - Bias + 1 10
DENORMALIZED Significant We are close to zero: M [0.0, 1.0) We always have one leading 0 11
Special Values Fraction Exponent Description 000 0 111 1 Infinity (+ / -) If an operation overflows!= 000 0 111 1 Not-a-Number (NaN) No numeric value can be determined sqrt(-1) 000 0 000 0 Zero is in fact all zero like integer zero (there is also a -0 in float) 12
-0? In IEEE arithmetic, it is natural to define log 0 = - and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -. 13
Tiny floating point example s exp frac 1 4 3 Typical exam question 8-bit floating point representation the sign bit is in the most significant bit. the next four bits are the exponent, with a bias of 7. the last three bits are the frac Same general form as IEEE Format normalized, denormalized representation of 0, NaN, infinity 14
Conversion Step 1: Normalize the Numbers Step 2: Round to fit in fraction Step 3: Post-normalize to deal with rounding effects Value Binary 128 1000 0000 15 0000 1101 Define 15 to be 13, i.e. 15 := 13 15
Conversion Step 1: Normalize the Numbers Set binary point s.t. has leading 1 Start with bias exponent = 7, decrement if need to left shift Value Binary Fraction Exponent 128 1000 0000 1.0000 0000 7 (no shift) 15 0000 1101 1.1010 0000 3 (4 shift) 16
Conversion Step 2: Round to fit in fraction We have 3 bit fractions Value Fraction GRS Rounded 128 1.0000000 000 1.000 15 1.1010000 100 1.101 17
Conversion Step 3: Post-normalize to deal with rounding effects Overflow in fraction due to rounding? (Not here) Shift right and increment exponent Value Binary 128 1000 0000 15 0000 1101 18
A possible Exam Question? You have a 8 bit floating point representation with 3 fraction bits. Give the floating point representations of 138 63 19
Multiplication Exact result: F new = 1 s 1 s 2 M 1 M 2 2 E 1+E 2 while( M 1 M 2 2 ) {M=M>>1; E++} Round M to fit fraction bits Check if exponent still in range 20
Addition Signed align and add (Assume E1 > E2) Shift the first operand by the difference of their exponents Add the M and s bits Apply shift and exponent adjustments till M is in 1.0 2.0 Round 21
http://meseec.ce.rit.edu/eecc250- winter99/250-1-27-2000.pdf 22
What Every Computer Scientist Should Know About Floating-Point Arithmetic: http://docs.oracle.com/cd/e19957-01/806-3568/ncg_goldberg.html 23
Assignment 08 Floating Point 24
Now its your turn! Implement your floating point handler in C! No use of floats/doubles Use the given skeleton 25
Your float_t You will represent the float as a struct 1. typedef struct float_t { 2. uint8_t sign; 3. uint8_t exponent; 4. uint32_t mantissa; 5. }; Challenge: Can you use bit fields for this and simply cast the pointer? 26
Conversion The only time you are allowed to use floats is when conversion it to your float_t 1. float_t fp_encode(float x); 2. float fp_decode(float_t x); 27
Approach Create some float numbers and convert them into your float_t. Choose good representatives Do some add, multiply, negations with your implemented functions and with the floats Compare at the end. 28
Approach Example 1. void main() { 2. float f1 = 1.123; 3. float f2 = 550; 4. float f3; 5. float_t ft1 = fp_encode(f1); 6. float_t ft2 = fp_encode(f2); 7. float_t ft2 8. float_t ft3; 9. 10. f3 = f1+f2; 11. ft3 = fp_add(ft1, ft2); 12. 13. assert(f3 == fp_decode(ft3)); 14. } 29
Submission Once you committed your final solution, write an e-mail to me! Subect: [CASP] Submission Content: Briefly describe what is working / what is not working Make sure your solution compiles! (with Wall) You can also submit your last homework 31
Have a nice weekend 32