IEEE-754 floating-point

Real and floating-point numbers Real numbers R form a continuum - Rational numbers are a subset of the reals - Some numbers are irrational, e.g. π Floating-point numbers are an approximation of real numbers - If finite in length, they are a subset of the rationals - Consist of a sign, a significant-digits part --- the mantissa or significand, and an exponent of the base (people usually use base 10)

Floating Point Floating-point numbers are represented by: - a sign - a significand or mantissa - an exponent Sign is easy sign part - 0 number is positive - 1 number is negative exponent part - numerically, factor = -1 sign significand part Significand and exponent have structure

Significand Floating point numbers are normalized - Represent as binary (fixed-point) number - Multiply by positive or negative power of 2, such that there is a single 1 bit to the left of the radix point Example: - 14.5 10 = 1110.1000 2 = 1.1101000 2 2 3 The leftmost bit (to the left of the radix) is always 1, so it doesn t need to be stored - The 1 is hidden or implicit - Store 1101000 as the significand Example 2: - 0.3125 10 = 0.0101 2 = (1.)0100000 2 2-2

Exponent Exponent is a power of 2 Exponents can be positive or negative Exponents are stored in Excess-N notation - N is typically 2 (m-1) 1 for m-bit storage Example: - 2 3 in 5 bits Excess-(2 (5-1) 1) = Excess-15-3 + 15 = 18 10 = 10010 2 Example: - 2-2 in 8 bits Excess-(2 (8-1) 1) = Excess-127 - -2 + 127 = 125 10 = 01111101 2

IEEE-754 a standard for representing floating-point (f.p.) numbers in computer systems - Three binary formats, two decimal formats - additional "storage" formats - adopted in 1985, updated in 2008 - many operational details All formats share some characteristics - Normalized - Implicit MSb - Sign-magnitude representation for significand - Excess-N representation for exponent - Special values for exceptional cases

Formats binary16 - "Half-precision" - storage only binary32 - "Single precision" binary64 - "Double precision" binary128 - "Quadruple precision" decimal32 - storage only decimal64 decimal128 Decimal formats are new to the 2008 revision IBM z-systems implement these formats

IEEE-754 Binary Formats

Examples 0 01111 0000000000 = 0x3c00 1, in Binary16: 0 01111 0000000000 = 0x3c00 - sign bit: 0 - exponent: 0 0+15 = 15 10 = 01111 2 - significand: 0000000000 2» leftmost 1-bit is implicit -2, in Binary32: 1 10000000 0000000 = 0xc000 0000 - sign bit: 1 - exponent: 1 1+127 10 = 128 10 = 10000000 2 - significand: 00000000000000000000000 2 0.3125, in Binary32: 0 01111101 01000000000 = 0x3ea0 0000 - sign bit: 0 - exponent: -2-2+127 10 = 125 10 = 01111101 2 - significand: 01000000000000000000000 2

Exceptional Values small Exponent = all 0 s 0 significand: true zero - positive and negative 0 are both legal non-zero significand: values are subnormal or denormalized no implicit one bit - trade off precision for smaller exponents Binary16 examples: - 0 00000 0000000000 = +0, true (positive) zero - 1 00000 1111111111 = -0.1111111111 2-14» the largest (negative) subnormal - 0 00000 0000000001 = 0.0000000001 2-14 = 1 2-24, the smallest possible number in Binary16

Exceptional Values large Exponent = all 1 s 0 significand: positive or negative infinity non-zero significand: NaN (Not a Number) and indication of an error condition - e.g. division by zero Binary16 examples: - 1 11111 0000000000 = negative infinity, - - 0 11111 1000000000 = quiet NaN, e.g. 0/0» indeterminate values the sign doesn t matter - 0 11111 0100000000 = signaling NaN» invalid operations e.g. a machine exception

Binary32 Format Again sign exponent significand 1 bit 8 bits 23 bits 1 if negative Excess-127 notation, range -126 to +127 normalized to 1 value < 2, leftmost 1 bit not represented All 0 s in the exponent and significand fields represent ± 0 Other values with all 0 s in the exponent field (looks like -127) are subnormal or denormalized values - exponent is -126 - hidden bit is 0 Values with all 1 s in the exponent field (looks like 128) and significand 0 (all 0 bits) represent ±infinity Other values with all 1 s in the exponent field represent NaNs "Not a Number" values

C types and IEEE-754 C's float datatype generally uses "single precision - a.k.a. Binary32 about 7 decimal digits of precision dynamic range roughly 10-45 to 10 +38 C's double datatype generally uses "double precision - a.k.a. Binary64 about 15 decimal digits of precision dynamic range roughly 10-324 to 10 +308 double frequently used for scientific calculations

Show the Bits in Binary32

Intel Processors "Endian"-ness Intel, AMD processors are "Little-endian" - Core i7, Opteron, etc. Little-endian: Least Significant Byte (LSB) stored in lowest memory address Big-endian: LSB stored in highest memory address - Most Significant Byte (MSB) stored in lowest memory address Multi-byte values are affected by the endianness - That's everything except characters

A routine to inspect endianness

Floating-Point -1.0 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x00 0000_0000 0x00 0000_0000 0x80 1000_0000 0xbf 1011_1111 f.p. value -1.0 is 1 7f 000000 1 0111_1111 0000_0000_0

Floating-Point -2.0 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x00 0000_0000 0x00 0000_0000 0x00 0000_0000 0xc0 1100_0000 f.p. value -2.0 is 1 80 000000 1 1000_0000 0000_0000_0

Floating-Point 8.5 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x00 0000_0000 0x00 0000_0000 0x08 0000_1000 0x41 0100_0001 f.p. value 8.5 is 0 82 080000 0 1000_0010 0000_1000_0

Floating-Point 8.99 in 32-bit Intel Memory: Memory address 0x7fff1b5a4360 0x7fff1b5a4361 0x7fff1b5a4362 0x7fff1b5a4363 contents 0x0a 0000_1010 0xd7 1101_0111 0x0f 0000_1111 0x41 0100_0001 f.p. value 8.99 is 0 82 0fd70a 0 1000_0010 0000_1111_1101_0111_0000_1010