Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Instructor: Nicole Hynes nicole.hynes@rutgers.edu 1

Fixed Point Numbers Fixed point number: integer part + fractional part Fixed number of digits to left and right of the radix point In decimal: 23.784 10 integer fraction 23.784 10 = 2 10 1 + 3 10 0 + 7 10-1 + 8 10-2 + 4 10-3 Similarly in binary: 10.1011 2 = 1 2 1 + 0 2 0 + 1 2-1 + 0 2-2 + 1 2-3 + 1 2-4 = 2 + 0 + 0.5 + 0 + 0.125 + 0.0625 = 2.6875 10 What about base b? 2

Converting Decimal Fraction to Binary Fraction Algorithm illustration: 0.6875 10 =? 2 int part frac part 0.6875 2 = 1.375 1 0.375 0.375 2 = 0.75 0 0.75 0.75 2 = 1.5 1 0.5 0.5 2 = 1.0 1 0 Read off int parts in order Therefore, 0.6875 10 = 0.1011 2 Stop when frac part = 0 3

Decimal Fraction to Binary Fraction Converting from decimal to binary may result in a nonterminating fraction. Example: 0.1 10 0.00011 2 repeating sequence May need to round to desired number of fractional places. Example: 0.1 10 =? 2 int part frac part 0.1 2 = 0.2 0 0.2 0.2 2 = 0.4 0 0.4 0.4 2 = 0.8 0 0.8 0.8 2 = 1.6 1 0.6 0.6 2 = 1.2 1 0.2 0.2 2 = 0.4 0 0.4 0.4 2 = 0.8 0 0.8 0.8 2 = 1.6 1 0.6 0.6 2 = 1.2 1 0.2... r e p e a t s 4

Rounding Because computers represent numbers using a fixed number of bits, both the range and precision of numbers that can be represented are limited. Precision is usually associated with the number of fractional bits allowed by the computer representation. If the number has more fractional bits than is allowed by the computer representation, the number must be rounded to the required precision. Example: How should 10.1011 2 to 2 fractional bits? 10.10 2? Or 10.11 2? 5

Rounding Rounding modes Let a be the number and ā be its rounded value 1. Round-toward-zero Round a to nearest number ā of desired precision such that ā a Also called truncation because it simply drops excess fractional bits 2. Round-down Round a to nearest number ā of desired precision such that ā a Also called round-toward-negative-infinity 3. Round-up Round a to nearest number ā of desired precision such that ā a Also called round-toward-positive-infinity 4. Round-to-even Round a to the number ā of desired precision such that a ā is minimized If there is a tie, choose the ā whose least significant digit/bit is even Also called round-to-nearest Default mode used in IEEE Floating Point Format, which we ll discuss next 6

Rounding Rounding examples Assume precision is 2 fractional bits Number Rounded Value Round-toward-0 Round-down Round-up Round-to-even 1.4523 10 1.45 10 1.45 10 1.46 10 1.45 10-2.1786 10-2.17 10-2.18 10-2.17 10-2.18 10 10.10011 2 10.10 2 10.10 2 10.11 2 10.10 2-1.00110 2-1.00 2-1.01 2-1.00 2-1.01 2-10.11100 2-10.11 2-11.00 2-10.11 2-11.00 2 1.10100 2 1.10 2 1.10 2 1.11 2 1.10 2 7

Fixed Point Arithmetic Adapt integer arithmetic algorithms Will illustrate for unsigned fixed point only Addition and Subtraction Similar to integer addition/subtraction Just align radix points Example: 100.101 2 + 10.1101 2 1 0 0. 1 0 1 = 4.625 + 1 0. 1 1 0 1 = 2.8125 1 1 1. 0 1 1 1 = 7.4375 align binary points 8

Multiplication Fixed Point Arithmetic 1. Ignore radix points; multiply as integers 2. Insert radix point of product: no. of fractional places = sum of no. of fractional places of two operands Example: 11.01 2 0.101 2 1 1.0 1 = 3.25 0.1 0 1 = 0.625 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0.0 0 0 0 1 = 2.03125 9

Fixed Point Arithmetic Division 1. Shift right radix point of divisor until it is a whole integer 2. Shift right radix point of dividend the same number of positions 3. Divide as in integer division 4. Radix point of quotient is in same position as that of dividend Example: 10.1101 2 1.01 2 (2.8125 10 1.25 10 ) 1011.01 2 101 2 (11.25 10 5 10 ) 1 0. 0 1 1 0 1 1 0 1 1. 0 1-1 0 1 0 1 0 1-1 0 1 0 = 2.25 10 May result in a quotient with non-terminating fractional part round to desired number of fractional places 10

Floating Point Numbers Fixed point numbers can also be written in scientific notation also referred to as floating point format: significand Decimal: 975.673 = 9.75673 10 2 0.000324 = 3.24 10-4 exponent Binary: 1101.011 = 1.101011 2 3 0.010111 = 1.0111 2-2 significand exponent Significand (a.k.a. mantissa) is normalized: exactly one digit/bit to left of decimal/binary point. Allows for more compact representation of real numbers than fixed point format. 11

Floating Point Representation Most computers support the IEEE 754 standard for encoding floating point numbers: Single precision (32 bits): C type float Double precision (64 bits): C type double Intel x86 processors also support extended precision format (80 bits) 12

IEEE Single Precision FP Format Normalized binary FP number Single precision FP format significand ±1.fraction 2 exponent s b_exp frac 32 bits 1 8 23 Field # Bits Value Remarks s 1 0 if number is positive; 1 if negative b_exp 8 exponent + bias, where bias = 2 8-1 -1 = 2 7-1 = 127 called the biased exponent frac 23 fractional part of significand 1 to left of binary point is not stored (hidden bit) 13

IEEE Single Precision FP Format Problem: Find the single precision FP representation of 54.625 10. Solution: 1. Convert to binary FP: 54.625 10 = 110110.101 2 2. Normalize binary FP: 110110.101 = 1.10110101 2 5 3. Map to single precision FP format: s = 1 frac = 10110101000000000000000 (pad with zeros to make 23 bits) b_exp = 5 + 127 = 132 = 10000100 4. Answer: 1 10000100 10110101000000000000000 14

IEEE Single Precision FP Format FP numbers that can be represented in IEEE single precision format: 1. Normalized values Numbers of the form ±1.fraction 2 exponent -126 exponent 127 1 b_exp 254 Most positive/negative number = ±1.11...1 2 127 Least positive/negative number = ± 1.00...0 2-126 Observations on b_exp: - always positive - 00000000 (all zeros) and 11111111 (all 1 s) not used: these bit patterns are used to represent special values s 0 & 255 frac 15

IEEE Single Precision FP Format 2. Denormalized values a. b_exp = 0 and frac = 0 represents the value ±0.0 Note: two representations of zero. s 00000000 00000000000000000000000 b. b_exp = 0 and frac 0 represents the binary number of the form ±0.fraction 2-126 s 00000000 frac Notes: - significand < 1 (bit to left of binary point is 0) - exponent of binary number must be -126 (= 1 bias) - allows representation of numbers smaller than least positive/negative normalized number, ± 1.00...0 2-126 16

IEEE Single Precision FP Format 3. Special values a. b_exp = all 1 s and frac = 0 represents the value ±. Typically used to represent results that overflow. s 11111111 00000000000000000000000 b. b_exp = all 1 s and frac 0 represents NaN ( Not a Number ). Typically used to represent results that can t be represented as a real number (e.g., 1 ). s 11111111 frac 0 17

Why Use a Biased Representation? The IEEE single precision FP format can be generalized to any number of exponent and fractional bits: ±1.fraction 2 exponent s b_exp frac 1 k n For a k-bit biased exponent field: - bias = 2 k-1 1 - b_exp = exponent + bias - Exponent of normalized FP number is limited to [ (2 k-1 2), (2 k-1 1)] - As a result, 1 b_exp (2 k 2) - As before, b_exp = all 0 s and all 1 s are used to represent denormalized values and special values 18

Why Use a Biased Representation? By biasing the exponent, i.e. adding (2 k-1 1) to the true exponent, the resulting biased exponent is always nonnegative and hence can be treated as an unsigned integer. Comparing unsigned integers is easy: Treated as unsigned integers, which is larger: 10100111 or 10111010? Compare bitwise starting from left (msb). Stop at bit position where the numbers differ. The number with a 1 bit is larger. 1 0 1 0 0 1 1 1 1 0 1 1 1 0 1 0 larger Can compare two numbers in IEEE FP format with the same sign using same algorithm: 0 10100111 01011100000000000000000 0 10111011 01101100000000000000000 +1.010111 2 40 +1.011011 2 60 < 19

IEEE Double Precision FP Format Normalized binary FP number Double precision FP format significand ±1.fraction 2 exponent s b_exp frac 64 bits 1 11 52 Field # Bits Value Remarks s 1 0 if number is positive; 1 if negative b_exp 11 exponent + bias, where bias = 2 11-1 -1 = 2 10-1 = 1023 called the biased exponent frac 52 fractional part of significand 1 to left of binary point is not stored (hidden bit) 20

x86 Extended Precision Normalized binary FP number Extended precision FP format significand ±1.fraction 2 exponent s b_exp 1 frac 80 bits 1 15 64 Field # Bits Value Remarks s 1 0 if number is positive; 1 if negative b_exp 15 exponent + bias, where bias = 2 15-1 -1 = 2 14-1 = 16,383 called the biased exponent frac 64 entire significand 1.fraction no hidden bit! 21

Floating Point Arithmetic Addition and Subtraction 1. Make exponents equal 2. Add/subtract significands 3. Normalize result Why? Let A = a 2 e1 and B = b 2 e2 and suppose e1 < e2 Then A can be rewritten as A = a 2 e2 2 -(e2-e1) Therefore, A + B = ( (a 2 -(e2-e1) ) + b ) 2 e2 Shift a right of the binary point (e2-e1) places; then add to b 22

Floating Point Arithmetic Addition Example: IEEE single precision format + s b_exp frac 0 01111101 00000000000000000000000 0 10000101 10010000000000000000000 1.0 2-2 = 0.25 10 1.1001 2 6 = 100.0 10 Don t forget the hidden bit! To simplify illustration, let s show the hidden bit. 0 01111101 1 00000000000000000000000 0.25 10 0 10000101 1 10010000000000000000000 100.0 10 hidden bit significand 23

Floating Point Arithmetic Addition Example, Cont. + 0 01111101 1 00000000000000000000000 0 10000101 1 10010000000000000000000 0.25 10 100.0 10 1. Make exponents equal To leave value unchanged: Shift significand left by 1 bit must decrease exponent by 1 Shift significand right by 1 bit must increase exponent by 1 Increase smaller exponent to equal larger exponent. Why? Will shift significand right, losing only least significant bits Therefore, increase exponent of 0.25 10, shifting significand right by 10000101 01111101 = 00001000 = 8 10 places 24

Floating Point Arithmetic Addition Example, Cont. Note that hidden bit is shifted into msb Shift significand of 0.25 10 right by 8 places 0 01111101 1 00000000000000000000000 original value 0 01111110 0 10000000000000000000000 shift right 1 place 0 01111111 0 01000000000000000000000 shift right 2 places 0 10000000 0 00100000000000000000000 shift right 3 places 0 10000001 0 00010000000000000000000 shift right 4 places 0 10000010 0 00001000000000000000000 shift right 5 places 0 10000011 0 00000100000000000000000 shift right 6 places 0 10000100 0 00000010000000000000000 shift right 7 places 0 10000101 0 00000001000000000000000 shift right 8 places 25

Floating Point Arithmetic Addition Example. Cont. 2. Add significands + 0 10000101 0 00000001000000000000000 0 10000101 1 10010000000000000000000 0.25 10 100.0 10 0 10000101 1 10010001000000000000000 3. Normalize result (already normalized; hide hidden bit) 0 10000101 10010001000000000000000 100.25 10 26

Floating Point Arithmetic Multiplication 1. Add exponents 2. Multiply significands 3. Normalize result Why? Let A = a 2 e1 and B = b 2 e2 Then, A B = ( a b ) 2 e1+e2 27

Floating Point Arithmetic Multiplication Example: IEEE single precision format s b_exp frac 0 01111100 01000000000000000000000 1 10000011 11000000000000000000000 1.01 2-3 = 0.15625 10-1.11 2 4 = -28.0 10 As before, let s show the hidden bit. 0 01111100 1 01000000000000000000000 0.15625 10 1 10000011 1 11000000000000000000000-28.0 10 hidden bit significand 28

Floating Point Arithmetic Multiplication Example. Cont. 1. Add true exponents b_exp 1 0 01111100 1 01000000000000000000000 0.15625 10 b_exp 2 1 10000011 1 11000000000000000000000-28.0 10 Note that these are biased exponents: b_exp 1 = true_exponent 1 + 127 true_exponent 1 = b_exp 1-127 b_exp 2 = true_exponent 2 + 127 true_exponent 2 = b_exp 2-127 Now, true_exponent result = true_exponent 1 + true_exponent 2. Therefore, b_exp result = true_exponent result + 127 = (b_exp 1 + b_exp 2 ) - 127 = (01111100 + 10000011) 01111111 = 10000000 29

Floating Point Arithmetic Multiplication Example. Cont. 2. Multiply significands significand 0 01111100 1 01000000000000000000000 0.15625 10 1 10000011 1 11000000000000000000000-28.0 10 significand result = 1.01 1.11 = 10.0011 sign result = 1 Why? b_exp result = 10000000 (from previous slide) 3. Normalize result shift significand result right by 1 bit 1.00011 (hide hidden bit in IEEE format!) increase b_exp result by 1 10000001 1 10000001 00011000000000000000000-4.375 10 30

Floating Point Arithmetic Division 1. Subtract exponents 2. Divide significands 3. Normalize result Why? Let A = a 2 e1 and B = b 2 e2 Then, A / B = ( a / b ) 2 e1-e2 31

Floating Point Arithmetic Division Example: IEEE single precision format s b_exp frac 0 10000110 00011000000000000000000 0 01111101 11000000000000000000000 1.00011 2 7 = 140.0 10 1.11 2-2 = 0.4375 10 As before, let s show the hidden bit. 0 10000110 1 00011000000000000000000 140.0 10 0 01111101 1 11000000000000000000000 0.4375.0 10 hidden bit significand 32

Floating Point Arithmetic Division Example. Cont. 1. Subtract true exponents b_exp 1 0 10000110 1 00011000000000000000000 140.0 10 b_exp 2 0 00111101 1 11000000000000000000000 0.4375.0 10 Note that these are biased exponents: b_exp 1 = true_exponent 1 + 127 true_exponent 1 = b_exp 1-127 b_exp 2 = true_exponent 2 + 127 true_exponent 2 = b_exp 2-127 Now, true_exponent result = true_exponent 1 - true_exponent 2. Therefore, b_exp result = true_exponent result + 127 = (b_exp 1 - b_exp 2 ) + 127 = (10000110 00111101) + 01111111 = 10001000 33

Floating Point Arithmetic Division Example. Cont. 2. Divide significands significand 0 10000110 1 00011000000000000000000 140.0 10 0 01111101 1 11000000000000000000000 0.4375.0 10 significand result = 1.00011 1.11 = 0.101 sign result = 0 b_exp result = 10001000 (from previous slide) 3. Normalize result shift significand result left by 1 bit 1.01 (hide hidden bit in IEEE format!) decrease b_exp result by 1 10000111 0 10000111 01000000000000000000000 320.0 10 34