# IEEE-754 floating-point

1 IEEE-754 floating-point

2 Real and floating-point numbers Real numbers R form a continuum - Rational numbers are a subset of the reals - Some numbers are irrational, e.g. π Floating-point numbers are an approximation of real numbers - If finite in length, they are a subset of the rationals - Consist of a sign, a significant-digits part --- the mantissa or significand, and an exponent of the base (people usually use base 10)

3 Floating Point Floating-point numbers are represented by: - a sign - a significand or mantissa - an exponent Sign is easy sign part - 0 number is positive - 1 number is negative exponent part - numerically, factor = -1 sign significand part Significand and exponent have structure

4 Significand Floating point numbers are normalized - Represent as binary (fixed-point) number - Multiply by positive or negative power of 2, such that there is a single 1 bit to the left of the radix point Example: = = The leftmost bit (to the left of the radix) is always 1, so it doesn t need to be stored - The 1 is hidden or implicit - Store as the significand Example 2: = = (1.)

5 Exponent Exponent is a power of 2 Exponents can be positive or negative Exponents are stored in Excess-N notation - N is typically 2 (m-1) 1 for m-bit storage Example: in 5 bits Excess-(2 (5-1) 1) = Excess = = Example: in 8 bits Excess-(2 (8-1) 1) = Excess = =

6 IEEE-754 a standard for representing floating-point (f.p.) numbers in computer systems - Three binary formats, two decimal formats - additional "storage" formats - adopted in 1985, updated in many operational details All formats share some characteristics - Normalized - Implicit MSb - Sign-magnitude representation for significand - Excess-N representation for exponent - Special values for exceptional cases

7 Formats binary16 - "Half-precision" - storage only binary32 - "Single precision" binary64 - "Double precision" binary128 - "Quadruple precision" decimal32 - storage only decimal64 decimal128 Decimal formats are new to the 2008 revision IBM z-systems implement these formats

8 IEEE-754 Binary Formats

9 Examples = 0x3c00 1, in Binary16: = 0x3c00 - sign bit: 0 - exponent: = = significand: » leftmost 1-bit is implicit -2, in Binary32: = 0xc sign bit: 1 - exponent: = = significand: , in Binary32: = 0x3ea sign bit: 0 - exponent: = = significand:

10 Exceptional Values small Exponent = all 0 s 0 significand: true zero - positive and negative 0 are both legal non-zero significand: values are subnormal or denormalized no implicit one bit - trade off precision for smaller exponents Binary16 examples: = +0, true (positive) zero = » the largest (negative) subnormal = = , the smallest possible number in Binary16

11 Exceptional Values large Exponent = all 1 s 0 significand: positive or negative infinity non-zero significand: NaN (Not a Number) and indication of an error condition - e.g. division by zero Binary16 examples: = negative infinity, = quiet NaN, e.g. 0/0» indeterminate values the sign doesn t matter = signaling NaN» invalid operations e.g. a machine exception

12 Binary32 Format Again sign exponent significand 1 bit 8 bits 23 bits 1 if negative Excess-127 notation, range -126 to +127 normalized to 1 value < 2, leftmost 1 bit not represented All 0 s in the exponent and significand fields represent ± 0 Other values with all 0 s in the exponent field (looks like -127) are subnormal or denormalized values - exponent is hidden bit is 0 Values with all 1 s in the exponent field (looks like 128) and significand 0 (all 0 bits) represent ±infinity Other values with all 1 s in the exponent field represent NaNs "Not a Number" values

13 C types and IEEE-754 C's float datatype generally uses "single precision - a.k.a. Binary32 about 7 decimal digits of precision dynamic range roughly to C's double datatype generally uses "double precision - a.k.a. Binary64 about 15 decimal digits of precision dynamic range roughly to double frequently used for scientific calculations

14 Show the Bits in Binary32

15 Intel Processors "Endian"-ness Intel, AMD processors are "Little-endian" - Core i7, Opteron, etc. Little-endian: Least Significant Byte (LSB) stored in lowest memory address Big-endian: LSB stored in highest memory address - Most Significant Byte (MSB) stored in lowest memory address Multi-byte values are affected by the endianness - That's everything except characters

16 A routine to inspect endianness

17 Floating-Point -1.0 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x _0000 0x _0000 0x _0000 0xbf 1011_1111 f.p. value -1.0 is 1 7f _ _0000_0

18 Floating-Point -2.0 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x _0000 0x _0000 0x _0000 0xc0 1100_0000 f.p. value -2.0 is _ _0000_0

19 Floating-Point 8.5 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x _0000 0x _0000 0x _1000 0x _0001 f.p. value 8.5 is _ _1000_0

20 Floating-Point 8.99 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x0a 0000_1010 0xd7 1101_0111 0x0f 0000_1111 0x _0001 f.p. value 8.99 is fd70a _ _1111_1101_0111_0000_1010

