Floating-Point Arithmetic - PDF Free Download

ENEE446---Lectures-4/10-15/08 A. Yavuz Oruç Professor, UMD, College Park Copyright 2007 A. Yavuz Oruç. All rights reserved. Floating-Point Arithmetic Integer or fixed-point arithmetic provides a complete representation over a domain of integers or fixed-point numbers, but it is inadequate for representing extreme domains of real numbers. Example: With 4 bits we can represent the following sets of numbers and many more: {0,1/16,2/16,,15/16}--All fractions (Not all fractions numbers are all fractions) {0,1/8,2/8,,7/8,1,1+1/8, 1+2/8, 1+3/8,,1+7/8} {0,1/4,2/4,3/4,1,1+1/4, 1+2/4, 1+3/4,2, 2+1/4, 2+2/4, 2+3/4,,2+7/8, 3, 3+1/4, 3+2/4, 3+3/4} {0,1/2,1,1+1/2, 2,2+1/2,3, 3+1/2,4, 4+1/2, 5,5+1/2, 6,6+1/2, 7,7+1/2} {0,1,2,,15}--All integers (Not all integers numbers are all integers) So, we can represent numbers in any range but we are always limited to 2 n numbers.

With a floating-point number system, we can represent very large numbers and very small numbers together! We use the scientific notation: u = ±m u b x u m u is a p-digit number, called the mantissa or significand x u is a k-digit number, called the exponent b > 2 is called the base.

The mantissa provides the precision or resolution of a floating-point number system whereas the exponent gives its range. Example: With p = 10, k = 20 and b = 10, and assuming that mantissas are sign-magnitude decimal fractions and exponents are decimal integers, we can represent numbers in the interval In this representation: [-(1-10 -10 )*10 20, (1-10 -10 )*10 20 ] The least and most positive numbers are 10-10 and (1-10 -10 )*10 20 The least and most negative numbers are -10-10 and -(1-10 -10 )*10 20

In nearly all modern processors, m u is a binary fraction x u is a binary exponent and base b = 2 Very often, m u is normalized so that it is between 1 and 2 (excluding 2). If mantissas are expressed in sign-magnitude notation, this means that they always begin with a 1 followed by the binary point as in 1.001101 or 1.111101100, etc. In some representations, 1 that is on the left of the binary point is removed from the notation and is called a hidden bit. (Hidden bit is always 1 for sign-magnitude mantissas.

Machine Representation of Floating-Point Numbers sign k-bit biased exponent p-bit mantissa with a hidden bit S X M 1 Hidden bit The true exponent, x, is found by subtracting a fixed number from the biased exponent, X. This fixed number is called the bias. For a k-bit exponent, the bias is 2 k-1-1, and the true exponent, x and X are related by x = X - (2 k-1-1)

Example: k = 3, x = X - (2 3-1 -1) = X - 3 X x algebraic value 000 101-3 001 110-2 010 111-1 011 000 0 100 001 1 101 010 2 110 011 3 111 100 4

Example: With p=2, k=2, and 1-bit sign, we have 32 floating-point numbers with biased exponents as shown in the table below. S = 0 exponent = -1 (denormalized) 0 00 00 = 0 0 00 01 = 1/8 0 00 10 = 1/4 0 00 11 = 3/8 0 0 01 00 = 1 0 01 01 = 5/4 0 01 10 = 3/2 0 01 11 = 7/4 1 0 10 00 = 2 0 10 01 = 5/2 0 10 10 = 3 0 10 11 = 7/2 2 0 11 00 = 4 0 11 01 = 5 0 11 10 = 6 0 11 11 = 7 S = 1 exponent = -1 (denormalized) 1 00 00 = 0 1 00 01 = -1/8 1 00 10 = -1/4 1 00 11 = -3/8 0 1 01 00 = -1 1 01 01 = -5/4 1 01 10 = -3/2 1 01 11 = -7/4 1 1 10 00 = -2 1 10 01 = -5/2 1 10 10 = -3 1 10 11 = -7/2 2 1 11 00 = -4 1 11 01 = -5 1 11 10 = -6 1 11 11 = -7

With a k-bit biased exponent and p-bit mantissa the most positive and negative representable numbers are ±2 2k 1 (1 1 2 p ) without a hidden bit, ±2 2k 1 (2 1 2 p ) with a hidden bit, Typical allocation of bits between the mantissa and exponent parts (Last two rows are IEEE754 standard formats for single and double precision floating-point arithmetic. Representation size Sign Exponent Mantissa 8 1 2 5 16 1 4 11 32 1 8 23 64 1 11 52

Precision of A floating-point representation In the IEEE-754 single precision floating-point representation, the mantissa is 23 bits long. This means that any two numbers in this representation cannot be closer than 1/2 23 = 1.1920928955078125 10-7. In double precision, this difference reduces to 1/2 52 = 2.220446049250313080847263336181640625 10-16. Given that 2 is a factor of 10, both binary fractions have an exact representation in decimal.

This can be seen if we write 1 2 p = 5 p 10 p Hence, we can compute 5 p as a (p Log 10 5)-digit number in decimal, and then divide it by 10 p by shifting the radix-point to right p places, where p is 23 or 52. Indeed, the number of digits in each of the representations is equal to 23 Log 10 5 =17 and 52 Log 10 5 =37.

The least and most positive and negative representations in IEEE-754 single precision floating-point format (with the hidden bit of 1) are Most positive Most negative 0 11111110 11111111111111111111111 1 11111110 11111111111111111111111 1 7 +(2- ) 22 1 1 = +(1-2 23 2 24 ) 1 7 2128 -(2- ) 22 1 1 = -(1- ) 2128 2 23 2 24 Least positive (denormalized) Least negative (denormalized) 0 00000000 00000000000000000000001 1 00000000 00000000000000000000001 + 1 7 21 2 = 2-150 1 7-21 2 = -2-150 2 23 2 23 The exponent 11111111 is reserved to represent extreme numbers such as, 0/0, /, etc.

IEEE754 Normalized and De-Normalized Numbers Denormalized -1 +1/2 p -1/2 p-1-1/2 p -0 +0 1/2 p 1/2 p-1 1 1/2 p Normalized -2 +1/2 p -1-0 +0 1 2 1/2 p Normalized + Denormalized -2 +1/2 p -1-1 +1/2 p -1/2 p-1-1/2 p -0 +0 1/2 p 1/2 p-1 1 1/2 p 1 2 1/2 p

Extreme Numbers in Floating-Number Systems In floating-point computations, besides the problem of precision, two other kinds of errors come from the results being either too large (overflow) or too small (underflow). Any result that is greater than the largest representable number is converted to. Any result that is less than 1/2 p is truncated to 0 +. Likewise, any result that is less than the largest representable negative number is converted to -. Any negative result that is greater than the least negative number is converted to 0-.

In mathematics, is used to represent a number that is greater than all real numbers. It is the limit point of real numbers as they get arbitrarily large, and used to represent an arbitrarily large value rather than a specific value as a finite real number would represent. For example, u 2 and u 3 will both tend to as u becomes arbitrarily large or tends to even though u 3 > u 2 for all u > 1. In real arithmetic, we also encounter numbers and/or computations such as, 0/0, -, 0, and /. Ratios such as 0/0 or / arise in the limit of computations such as (u-1)/(u 3-1) as u tends to 1 or. We can also have 0 when we try to compute u (1/u) as u tends to 0.

NaNs, QNaNs and SNaNs Floating-point number systems set aside certain binary patterns to represent and other undefined expressions and values that involve. In IEEE-754 floating-point number system, the exponent 11111111 is reserved to represent undefined values such as, 0/0, -, 0, and /. The last four cases are referred to as Not-a-Number (NaN) and represent the outcomes of undefined real number operations. These special values are represented by setting X to 2 k -1, or equivalently x to 2 k-1.

The mantissa of the representation is used to distinguish between and NaNs. If M = 0 and X = 2 k -1, then the representation denotes. If M 0 but X = 2 k -1, then the representation is for NaN. In all of these special representations, the sign bit is used to distinguish between positive and negative versions of these numbers, i.e., +0, -0, +, -, +NaN, -NaN. The NaNs are further refined into quiet NANs (QNaNs) and signaling NaNs (SNaNs). The QNaNs are designated by setting the most significant bit of the mantissa, and the SNaNs are specified by clearing the same bit. The QNaNs can be viewed as NaNs that can be tolerated during the course of a floating-point computation whereas SNaNs will force the processor to signal an invalid operation as in the case of division of 0 by 0.

Example: The numbers in the first row below represent + and -, respectively, and those in the second row represent NaNs in a 16-bit floating-point number system: 0 1111 00000000000 = + 1 1111 00000000000 = - 0 1111 00000001000 = +SNaN 1 1111 10100010011 = -QNaN

Approximation of Real Numbers by Floating-Point Numbers As p gets large, the distance between consecutive mantissas gets smaller, and tends to 0 as p tends to. However, regardless of how large p becomes, not all decimal fractions can be represented in a binary mantissa format. For example, any decimal fraction which includes 2 -s in its binary expansion, where s > p, cannot be represented in p bits, but this not is the end of the story, a whole bunch of other numbers cannot be represented either even if they are greater than 2 -p.

In fact, for any decimal fraction, d, to have an exact binary mantissa representation in p bits, 2 p d must be an integer since if and only if d = m p 1 + m p 2 + + m 0 2 1 2 2 2 p 2 p d = 2 p 1 m p 1 + 2 p 2 m p 2 + + m 0 which implies that the left hand side of the equation must be an integer for the equation to hold since the right hand side is an integer.

Now, suppose that d is an r-digit decimal fraction and it has an exact representation in p bits. It is easy to show that r < p, and by the argument above, must be an integer. 2 p d 10 r = 2p r d 10 r 10 r 5 r This implies that 5 r must evenly divide d 10 r or d 2 r must be an integer since 5 is relatively prime to 2, and 5 r cannot divide 2 p-r. Conversely, it can be shown that if r < p, d < 1, 5 r evenly divides d 10 r or equivalently d 2 r is an integer then d must have an exact representation in p bits.

For example, 0.125 can be represented exactly in p = 3 bits since 0.125 * 8 = 1 is an integer and r = 3 < p. By the same token, all multiples of 0.125 that can be written in three or fewer digits can be represented exactly by a 3-bit mantissa. These are 0.125, 0.25, 0.375, 0.5, 0.625, 0.750, and 0.875. No other decimal fraction can be represented by a 3-bit mantissa. Likewise, when p = 4, only the integral multiples of the decimal fraction 0.0625 can be represented by 4-bit mantissas since 5 4 only divides 10 4 0.0625 = 625 and its integral multiples evenly. Clearly, there are exactly 15 such proper fractions, i.e., excluding 0, when p = 4.

In general, it is easy to verify that only the 2 p -1 multiples of the fraction 1/2 p can be represented in p bits, excluding 0 as shown below. These are all the fractions that can be represented in p bits. (2 p -1)/2 p (2 p -2)/2 p 3/2 p 2/2 p 1/2 p

For each fraction m with an exact representation in p bits, there is an infinite set of numbers in the open interval (m,m-1/2 p ). None of these has an exact representation in p bits. Each such number must therefore be approximated one way or another, and the most natural choices are the boundary points of the interval.

This is because one of these boundary points would be closer to the number that is being approximated than all the others in the representation. If, for any number m u in this interval as shown in the figure below, m u (m 1/ 2 p ) < m m u or m 1/ 2 p < m u < m 1/ 2 p+1 then m u is closer to m-1/2 p than it is to m. Therefore, it should be approximated by m-1/2 p. On the other hand, if then m u is closer to m, and it should be approximated by m. Finally, if then it can be approximated by either of the end point numbers. m 1 2 p m u m m m 1 2 p +1 m u = m 1/ 2 p+1

Example 2.1.Let p = 8, m = 12/16 = 0.75. In binary, m has an exact representation and is given by.11000000. Now, consider the numbers in the interval ( 12/16 1/ 256,12/16) None of these numbers have an exact representation if we use an 8-bit mantissa. One such number is which is clearly greater than approximated by 12/16. 12/16 1/256 + 3/1024 = 12/16 1/1024 12/16 1/512. Therefore, it should be

The process of approximating a floating-point number is often carried out by rounding or truncating it. In both cases, digits outside the available number of digits are removed from the representation. However, when a (p+r)-bit mantissa is rounded to a p-bit mantissa, we add 1/2 p to it if the (p+1) st bit is 1 and drop the last r bits if the same bit is 0. When it is truncated, we simply drop the rightmost r bits.

In the above example, the 10-bit fraction that represents m u = (0.1011111111) 2 12/16 1/256 + 3/1024, is approximated by an 8-bit fraction, (0.11000000) 2. The latter number represents 12/16. This amounts to rounding rather than truncation as the latter fraction is obtained by adding (0.00000001) 2 to (0.1011111111) 2 in order to represent m u in 8 bits.

Approximating x by truncation would result in (0.10111111) 2 with the last two bits removed without altering the rest of digits. This would give 12/16 1/256 that is clearly not the closest 8-bit fraction to x in this case. On the other hand, if m u =(12/16 1/256, 1/512), i.e., m u = (0.1011111110) 2 then it is exactly in the middle of the interval (12/16 1/256, 12/16). Rounding will approximate it to 12/16 and truncating will carry it to 12/16 1/256. In this case, both approximations are equally far apart from m u.

In general, rounding a real number always leads to the closest representable floating-point number except when the number is at an equal distance between one of the endpoints and the middle point of an interval into which it falls. In the latter case, truncation is as precise as rounding. In the truncation of decimal numbers, this happens when the digit to be rounded is 5, and by convention, it is rounded up to the next digit as, for example, 49.5 would be rounded to 50 rather than 49. Truncating it would give 49 which is as far apart from 49.5 as 50. Rounding or truncating a number introduces computational errors into an operation. These errors are usually unavoidable, and can have significant undesirable effects in the result of the computation.

Example: Consider, for example, the machine numbers and 0 1101 10000000000 = (1.1) 2 2 6 = 96, 0 1101 10000000001 = (1.10000000001) 2 2 6 = 96.03125. These representations are ``adjacent'', i.e., we cannot represent any other numbers between 96 and 96.03125 if we use an 11-bit mantissa. Now suppose we want to add 1000 fractions to 96 all of which are less than 0.3125, say they are around 0.02. If we perform the addition so that each fraction is added to 96 one after another, the result of the first addition will be about 96.02, but it will be truncated back to 96, assuming that we are using an 11-bit mantissa. Similarly, adding the second, the third, and adding all subsequent fractions will have no effect, so the result of the computation will be 96 whereas the correct result should have been 96 + 20 = 116. Therefore, care should be taken when adding fractions or small numbers to large numbers. In this example, a result which is much closer to 116 can be obtained by first summing the thousand fractions and then adding this sum to 96.

2 s Complement Floating-point Number System Most processors use a sign-magnitude representation to represent mantissas in floating-point numbers. Instead, one can also use 1's or 2's complement notation as in fixed-point numbers to represent signed mantissas. This makes the subtraction of mantissas easier to handle. Determining the value of a floating-point number with a 2's complement mantissa is only slightly more complex. In fact, if the sign bit of the mantissa is 0 then the value of the number is the same as if its mantissa is expressed in sign-magnitude notation. When the leading bit is 1 then the number is negative, and its value is determined by complementing its bits and adding 1/2 p to it, where p is the number of bits in the mantissa part of the number.

Example: Consider the floating-point number 101011.01110111 in 2's complement notation. Its value is determined by complementing the bits and adding 0.00000001 to it. -(010100.10001000 + 0.00000001) 2 = -(010100.10001001) 2 = -(20.53515625) 10.

Floating-Point Addition and Subtraction When adding or subtracting two floating-point numbers, we must first align their exponents. This is done by shifting the mantissa of the floating-point number with the smaller exponent to right while increasing its exponent and until its exponent is equal to the exponent of the other floating-point number. After the exponents are aligned, the operation (either addition or subtraction) is performed on the two mantissas, and the larger exponent is made the exponent of the result. The final step is to shift the mantissa and increase or decrease the exponent so that the mantissa is in normalized form.

Example Let u = 5.0 and v = 1.25 be represented as 16-bit floating-point numbers with a 4-bit biased-exponent, and 11-bit sign magnitude with a hidden bit. Let M u, M v, M r represent the mantissas of u, v, and u-v, and let E u, E v, E r represent the biased exponents of u, v, and u-v. The difference u-v is computed as follows:

Design of A Floating-Point Adder/Subtractor alignment and shift logic 1 2 add/sub M u M v S u S v Bus & function select logic a-bus (p+1)-bit complementer b-bus E u E v operation 1 1 (p+1)-bit adder 3 k-bit adder S r M r Exponent correction control logic 4 Normalization logic clock

Algorithm 2.1 (sign-magnitude floating-point addition) { //Add the hidden bits to M u and M v if they are not denormalized if(e u!= 0) M u = 1 + M u ; if(e v!= 0) M v = 1 + M v ; //Align if (E u > E v ) {M v = M v 2 Ev-Eu ; E r = E u ;} else if (E u < E v ) {M u = M u 2 Eu-Ev ; E r = E v ;} else E r = E u ; //Add if operation = 0 subtract if operation = 1 switch(s u ) {case 0: switch(s v ) {case 0: switch(operation) { case 0: M r = M u + M v ; break; case 1: M r = M u + ~ M v + 1; break;} break; case 1: switch(operation) { case 0: M r = M u + ~ M v + 1; break; case 1: M r = M u + M v ; break; } break;} break; case 1: switch(s v ) {case 0: switch(operation) { case 0: M r = ~ M u + 1 + M v ; break; case 1: M r = ~ M u + 1 + ~ M v + 1; break; } break; case 1: switch(operation) { case 0: M r = ~ M u + 1 + ~ M v + 1; break; case 1: M r = ~ M u + 1 + M v ; break; } break;} break; } //Normalize if (M r >= 4) {M r = M r /2; E r = E r +1; e r = e r +1; } else while (M r < 1/2) {M r = M r 2; E r = E r - 1; e r = e r - 1;} if (E r < 256) F = 0; else {F = 1; E r = 255; e r = 128; M r = 2 23-1;} //Set the sign bit and magnitude If (M r > 0 ) S r = 0; else {S r = 1; M r = ~ M r + 1;} }

Floating-Point Multiplication and Division When multiplying or dividing two floating-point numbers, the exponents and mantissas are again treated separately in these operations. Unlike as in floating-point addition/subtraction, it is not necessary to align the exponents to multiply or divide two floating-point numbers. Simply, the mantissas are multiplied (or divided), and the exponents are added (or subtracted). If the sign bits of the two numbers are the same, then the resulting sign bit is 0, otherwise it is set to 1. Finally, the resulting number is normalized by shifting the mantissa and increasing the exponent of the result, if needed.

In the case of multiplication, the biased exponent of the product must be corrected since adding two biased exponents introduces an extra bias. That is, when two floating-point numbers, u and v are multiplied, adding their exponents E u = e u +2 k-1-1 and E v = e v +2 k-1-1 results in e u +2 k-1-1 + e v +2 k-1-1 = e u + e v + 2 (2 k-1-1) This bias must be corrected by subtracting 2 k-1-1 from it. In contrast, when two floating-point numbers, u and v are divided, subtracting their exponents E u = e u +2 k-1-1 and E v = e v +2 k-1-1 results in e u + 2 k-1-1 - e v - 2 k-1 + 1 = e u - e v

Therefore, in this case, we need to add 2 k-1-1 to correct the bias. These extra steps can be carried out concurrently while mantissas are being multiplied or divided since the exponent of the result is not needed in the computation of the mantissas. All these ideas have been formalized in the algorithm below:

Algorithm (floating-point multiplication/division) {//u is a sign-magnitude, biased exponent floating-point number. //v is a sign-magnitude, biased exponent floating-point number. //operation is a binary variable which specifies whether a multiplication or division is to be performed. //Add the hidden bit to M u and M v M u = 1 + M u ; M v = 1 + M v ; //Multiply if operation = 0 divide if operation = 1 switch(operation) {case 0: M r = M u M v ; E r = E u + E v - 2 k-1 +1; break; case 1: M r = M u / M v ; E r = E u + E v + 2 k-1-1; break;} //Set the sign bit if (S u = S v ) S r = 0; else S r = 1; //Normalize if (M r >= 2) {M r = M r /2; E r = E r +1; e r = e r +1; } else while (M r < 1/2) {M r = M r 2; E r = E r - 1; e r = e r - 1;} if (E r < 256) F = 0; else {F = 1; E r = 255; e r = 128; M r = 2 23-1;} }

The multiplication and division steps in this algorithm are left unspecified and can be carried out using any of the multiplication and division algorithms we described for integer operands. In the case of multiplication, the product of two p-bit mantissas is given by the expression: 1+ M 1 1+ M 2 = (2p + M 1 ) (2 p + M 2 ) 2 p 2 p 2 2p So, effectively, we are multiplying two (p+1)-bit integers to obtain a 2(p+1)-bit product, and then divide the product by 2 2p. The division by 2 2p amounts to shifting the binary point from the right hand side of the right most bit in the product to the right of the 2 nd left most bit.

Furthermore, only the highest p+1 bits of the 2(p+1)-bit product are retained since the precision of the representation is limited to p+1 bits. This makes it redundant to compute the lower p+1 bits of the product which comes from the multiplication of the lower (p+1)/2 bits of the two mantissas.

Example 2.1. Let u = -6.5 and v = 3.5 be represented as 16-bit floating-point numbers with a 4-bit biased-exponent, and 11-bit sign magnitude with a hidden bit. The product u v is computed as follows: Step 1: Express u and v as floating-point numbers. u = 1 1001 10100000000, v = 0 1000 11000000000 Step 2: Compute the exponent E r = E u + E v - 2 3 +1. E r = 1001 + 1000 0111 = 1010 Step 3: Compute the mantissa M r = (1+M u ) (1+M v ). (1.10100000000) (1.11000000000) = 10.1101100000 Step 4: Normalize M r by shifting it right once M r = 011011000000 Step 5: Adjust the exponent by incrementing it by 1. E r = 1011 Step 6: Combine the sign, S r, E r, M r. u v = 1 1011 01101100000

(110100 000000) (111000 000000) (110100 * 111000) 000000000000 + (110100 * 000000 + 000000 * 111000) 000000 + 000000 000000 (this is ignored, not just because it is 0 but also, it is outside the 12-bit representation) = (52 56) /1024 = 2912/1024 = (2048 +512 + 256 + 64 + 32)/1024 = 2912/1024 = 10.1101100000

Division works similarly with the following formula: 1+ M 1 1+ M 2 2 p 2 p = 2p + M 1 2 p + M 2 Again, it is seen that the division of mantissas is reduced to the division of two (p+1)-bit integers. Any of the division algorithms can be used to carry out this division. The (p+1)-bit quotient obtained in the division becomes the mantissa of the result, and since the division involves two numbers which are between 2 p and 2 p+1, any mantissa which results from a division of two normalized numbers will always be between 1/2 and 2.

Let u = -6.5 and v = 3.5 be represented as 16-bit floating-point numbers with a 4-bit biased-exponent, and 11-bit sign magnitude with a hidden bit. The division u / v is computed as follows: Step 1: Express u and v as floating-point numbers. u = 1 1001 10100000000, v = 0 1000 11000000000 Step 2: Compute the exponent E r = E u - E v + 2 3-1. E r = 1001-1000 + 0111 = 1000 Step 3: Compute the mantissa M r = (1+M u ) / (1+M v ). 1.10100000000/ 1.11000000000 = 0.11101101101 Step 4: Normalize M r by shifting it left once M r = 11011011010 Step 5: Adjust the exponent by decrementing it by 1. E r = 0111 Step 6: Combine the sign, S r, E r, M r. u v = 1 0111 11011011010

110100000000/ 111000000000 = 11010/ 11100 = 1/2 (11010/ 1110) 11010 (dividend, u) 1110 (divisor, v) 1110 (shift and subtract) 0.1 11000 1110(shift and subtract) 0.11 10100 1110(shift and subtract) 0.111 011000 1110(shift twice and subtract) 0.11101 10100 1110(shift and subtract) 0.111011 011000 1110(shift twice and subtract) 0.11101101 10100 1110(shift and subtract) 0.111011011 011000 1110(shift twice and subtract) 0.11101101101 (u/v) 1010

We terminate the division at the end of 12 bits since the number of bits in the mantissa is limited to 12 bits. Moreover, since the remainder is not equal to 0 after the last shift and subtraction step, the ratio u/v does not have an exact representation in 12 bits. In fact, a closer examination of the process shows that the shift and subtract steps entered a repetitious pattern once the remainder of 0110 is obtained. Therefore, u/v cannot have an exact representation regardless of how many bits we use.

The division algorithm can be implemented in hardware using a k- bit 2 s complement adder/subtractor, a p-bit multiplier, and a p- bit divider. For multiplication, we can use the compact multiplier hardware that was described earlier or design an algorithm which generates only the most significant p bits of the product since the remaining p bits are discarded.

For division, we can use either restoring or non-restoring division, and discard the remainder. Unlike the implementation of the floating-point addition and subtraction operations, floating multiplication and division operations are generally implemented separately in hardware. This stems from the fact that division takes more clock cycles to execute than multiplication. As we will see in subsequent chapters, in processors in which several operations can be scheduled for execution in parallel, it is desirable to execute these operations on different hardware units to speed up computations.

Machine Arithmetic in Real Processors Motorola integer arithmetic instructions

PowerPC processors have four multiply and two divide instructions. Multiply instructions provide either the higher or lower half of a 64-bit product when two 32-bit numbers are multiplied. More specifically, in 32-bit mode, mulhw and mulhwu instructions multiply two 32-bit signed or unsigned operands and store the higher 32 bits of the product in a register. Similarly, mullw and mulli instructions retain the lower 32 bits of a 64-bit (or 48-bit) product that results from the multiplication of two 32- bit register operands or a 32-bit operand and a 16-bit signed number. A full 64-bit product can be obtained by a pair of multiply instructions. For example, mulhw and mullw can be used together to multiply two 32-bit signed numbers into a 64-bit signed product. It is also possible to obtain a 64-bit product using the 64-bit multiplication instructions with 32-bit operands. These same instructions can also be used together to obtain a signed 128-bit product of two 64-bit operands.

PowerPC's divide instructions divw, divd (signed division), divwu, divdu (unsigned division) divide a 32 or 64-bit dividend by a 32 or 64-bit divisor to produce a 32-bit quotient without a remainder. Even though the remainder is not computed by the execution of these division instructions, it can be obtained by subtracting the product of the quotient with the divisor from the dividend. In division instructions, division by 0 is not allowed, and sets the OV(overflow) flag when it is attempted. The OV flag is also set when the divw or divd instruction is used to divide -2 31 or -2 63 by 1 (Can you guess why?)

Instruction Operation Comments faddx,faddsx f d = f a + f b Floating-point operands in f a and f b are added and stored in f d. fsubx,fsubsx f d = f a - f b Floating-point operands in f a and f b are subtracted and stored in f d. fmulx,fmulsx f d = f a f b Floating-point operands in f a and f b are multiplied and stored in f d. fdivx,fdivsx f d = f a / f b Floating-point operand in f a is divided by that in f b and stored in f d. fmaddx,fmaddsx f d = f a f b +f c f a f b +f c is stored in f d. fnmaddx,fnmaddsx f d = -(f a f b +f c ) -(f a f b +f c ) is stored in f d. fmsubx,fmsubsx f a f b - f c f a f b - f c is stored in f d. fnmsubx,fnmsubsx f d = -(f a f b - f c ) -(f a f b - f c ) is stored in f d. fabs f d = f a Sign bit of f a is cleared and f a is stored in f d. fnabs f d = - f a Sign bit of f a is set and f a is stored in f d. fneg f d = -f a Sign bit of f a is inverted and f a is stored in f d. fres f d = 1/f a Estimate of the reciprocal of f a is stored in f d. fsqrtx,fsqrtsx f d = f a Square root of f a is stored in f d. frsqrtex f d = 1/ f a Estimate of the reciprocal of the square root of f a is stored in f d. Motorola floating-point arithmetic instructions

Instruction Operation Comments add r/m d = r/m d + r/m s or immediate operand Operands in r/m d and r/m s or immediate operand are added and stored in r/m d. adc r/m d = r/m d + r/m s or immediate operand + CF (carry) Same as add except that CF is included in the addition. sub r/m d = r/m d - r/m s or immediate operand Operands in r/m d and r/m s are subtracted and stored in r/m d. sbb r/m d = r/m d - r/m s or immediate operand Same as sub except that CF is included in the subtraction. dec r/m d = r/m d + 1 Decrement the operand in r/m d and store it in r/m d. inc r/m d = r/m d - 1 Increment the operand in r/m d and store it in r/m d. neg r/m d = -r/m d Negate the operand in r/m d and store it in r/m d. mul rdx:rax = rax r/m s An unsigned 128-bit product of 64-bit operands in rax and r/m s is stored in the register pair rdx:rax. imul rdx:rax = rax r/m s ; or r d = r d r/m s or immediate operand; or r d = r a r/m b, immediate operand; A signed 128-bit product of 64-bit operands in rax and r/m s is stored in the register pair rdx:rax. Higher order 64 bits of a signed product of operands in r d and r/m s or immediate operand is stored in r d. Higher order 64 bits of a signed product of operands in r a and r/m b or immediate operand is stored in r d. div rax = Quotient [rdx:rax / r/m s ] Unsigned 128-bit operand in the register pair rdx:rax is rdx = Remainder [rdx:rax / r/m s ] idiv rax = Quotient [rdx:rax / r/m s ] rdx = Remainder [rdx:rax / r/m s ] divided by the 64 operand in r/m d. Signed 128-bit operand in the register pair rdx:rax is divided by the 64 operand in r/m d. A subset of Intel 64 architecture integer arithmetic instructions

Like PowerPC processors, Intel 64 processors support both signed and unsigned multiplication and division. The multiplication instructions, imul and mul handle signed and unsigned multiplication with a variety of operand combinations including 16, 32 and 64-bit products for two 8, 16, 32 and 64-bit operands. Likewise, idiv and div instructions provide signed and unsigned division for 16, 32, 64 and 128-bit dividends with corresponding 8, 16, 32 and 64-bit divisors. Results of the multiplication and division instructions are stored in specialized pairs of 64-bit registers, rax and rdx except for some of the signed multiplication instructions.

Intel 64 architecture processors also perform decimal arithmetic using packed and unpacked decimal operands. A packed decimal operand contains 8 decimal digits in the 32-bit mode, and 16 decimal digits in the 64-bit mode. An unpacked decimal uses only lower four bits of a byte, and so in the 32-bit mode, an unpacked decimal operands contains only four decimal digits, and in the 64- bit mode, it contains eight decimal digits.

Intel 64 architecture processors do not have decimal add or subtract instructions. Instead, they have instructions to convert binary values to packed and unpacked decimals. When two BCD (binary-codeddecimal) digits u and v are added as 4-bit binary numbers, a correction is performed by adding 6 to the sum u + v when it exceeds 9. This is because x-10 (mod 16) = x + 6 (mod 16) since -10 and 6 are congruent mod 16. i.e., adding 6 is the same as subtracting 10 in modulo 16. For example, (7+8) 10 = (0111 + 1000 + 0110) BCD = (1 0101) BCD = (15) 10 (9+8) 10 = (1001 + 1000 + 0110) BCD = (1 0111) BCD = (17) 10

Similarly, when subtracting two decimal digits in binary, if the difference u - v > 10, it must be decreased by 6. For example, (7-5) 10 = (0111-0101) BCD = (0 0010) BCD = (2) 10 (7-8) 10 = (0111-1000 - 0110) BCD = (0 1001) BCD = (-1) 10 (5-7) 10 = (0101 0111-0110) BCD = (0 1000) BCD = (-2) 10

Instruction Operation Comments fadd fpu stack register = fpu stack register + r/m s or immediate operand fsub fpu stack register = fpu stack register - r/m s or immediate operand fsubr fpu stack register = r/m s or immediate operand - fpu stack register fmul fpu stack register = fpu stack register r/m s or immediate operand fdiv fpu stack register = fpu stack register / r/m s or immediate operand fdivr fpu stack register = r/m s or immediate operand / fpu stack register fsin fpu stack register 0 = sine (fpu stack register 0 ) argument in radians fcos fpu stack register 0 = cosine (fpu stack register 0 ) argument in radians fsincos fpu stack register 0 = sine (fpu stack register 0 ) fpu stack register 1 = cosine (fpu stack register 0 ) argument in radians fptan fpu stack register 0 = tangent(fpu stack register 0 ) argument in radians fatan fpu stack register 0 = arctangent(fpu stack register 0 ) fsqrt fpu stack register 0 = squareroot(fpu stack register 0 ) A subset of Intel 64 architecture floating-point arithmetic instructions