Module 2: Computer Arithmetic - PDF Free Download

Module 2: Computer Arithmetic 1 B O O K : C O M P U T E R O R G A N I Z A T I O N A N D D E S I G N, 3 E D, D A V I D L. P A T T E R S O N A N D J O H N L. H A N N E S S Y, M O R G A N K A U F M A N N P U B L I S H E R S

Complement Addition Subtraction Multiplication Division Arithmetic Operations 2

Introduction Numbers are represented by binary digits (bits): How are negative numbers represented? What is the largest number that can be represented in a computer world? What happens if an operation creates a number bigger than can be represented? What about fractions and real numbers? 3 A mystery: How does hardware really multiply or divide numbers?

Binary Numbers 4

Binary Numbers Binary Number System System Digits: 0 and 1 Bit (short for binary digit): A single binary digit LSB (least significant bit): The rightmost bit MSB (most significant bit): The leftmost bit 5 Upper Byte (or nybble): The right-hand byte (or nybble) of a pair Lower Byte (or nybble): The left-hand byte (or nybble) of a pair

Binary Equivalents Binary Numbers 1 Nybble (or nibble) = 4 bits 1 Byte = 2 nybbles = 8 bits 1 Kilobyte (KB) = 1024 bytes 1 Megabyte (MB) = 1024 kilobytes = 1,048,576 bytes 1 Gigabyte (GB) = 1024 megabytes = 1,073,741,824 bytes 6

Binary Numbers 7 Base 2

Binary Addition Rules of Binary Addition 0 + 0 = 0 0 + 1 = 1 1 + 0 = 1 1 + 1 = 0, and carry 1 to the next more significant bit Example: 26 +12 8 Carry

Binary Subtraction Rules of Binary Subtraction 0-0 = 0 0-1 = 1, and borrow 1 from the next more significant bit 1-0 = 1 1-1 = 0 Example: 37-17 9 borrowed

Binary Multiplication Rules of Binary Multiplication 0 x 0 = 0 0 x 1 = 0 1 x 0 = 0 1 x 1 = 1, and no carry or borrow bits Example: 23 x 3 10 Another Method: Binary multiplication is the same as repeated binary addition

Binary division 11

2 s Complement Two's complement representation allows the use of binary arithmetic operations on signed integers, yielding the correct 2's complement results. Positive Numbers: Positive 2's complement numbers are represented as the simple binary. Negative Numbers: Negative 2's complement numbers are represented as the binary number that when added to a positive number of the same magnitude, will equals zero. 12

13 To represent positive and negative numbers look at the MSB (or the sign bit) MSB = 0 means positive MSB = 1 means negative

Step 1: Calculation of 2's Complement 14 invert the binary equivalent of the number by changing all of the ones to zeroes and all of the zeroes to ones (also called 1's complement) Step 2: Then add one. Example: -17-17 = 11101111 17 = 00010001 Step1 : 11101110 Step2 : 11101110 + 1 ----------------------- 11101111

2's Complement Addition Two's complement addition follows the same rules as binary addition. Example: 5 + (-3) 5 = 00000101 15-3 = 11111101 Ignore 00000101 + 11111101 ----------------------- 1 00000010 (2)

2's Complement Addition Two's complement addition follows the same rules as binary addition. Example: 3 + (-5) 3 = 00000011 16-5 = 11111011 00000011 + 11111011 ----------------------- 11111110 (-2)

2's Complement Subtraction 17 Two's complement subtraction is the binary addition of the minuhend to the 2's complement of the subtrahend (adding a negative number is the same as subtracting a positive one). Example : 7 12 7 + (-12) minuhend subtrahend

2's Complement Subtraction 18 Example : 7 12 7 + (-12) 7 = 00000111-12 = 11110100 00000111 + 11110100 ----------------------- 11111011 (-5) CHECK! -5 = 11111011 Step1 : 00000100 Step2 : 00000100 + 1 ----------------------- 00000101 (5)

2's Complement Multiplication Two's complement multiplication follows the same rules as binary multiplication. Example : (-4) 4 = (-16) Ignore 4 = 00000100-4 = 11111100 11111100 x 00000100 ----------------------- 11 11110000 (-16) 19 CHECK! -16 = 11110000 Step1 : 00001111 Step2 : 00001111 + 1 ----------------------- 00010000 (16)

2's Complement Division Two's complement division is repeated 2's complement subtraction. 20 The 2's complement of the divisor is calculated, then added to the dividend. For the next subtraction cycle, the quotient replaces the dividend. This repeats until the quotient is too small for subtraction or is zero, then it becomes the remainder. The final answer is the total of subtraction cycles plus the dividend remainder Example : 6/ 3 = 2 divisor quotient

2's Complement Division Example: 6/3 Example: 7/3 6 + (-3) = 3 3 + (-3) = 0 6 / 3 = 2 Cycle 1 Cycle 2 The number of cycle 21 7 + (-3) = 4 4 + (-3) = 1 Cycle 1 Cycle 2 7 / 3 = 2 remainder 1

2's Complement Division 22 Example: 42/6 42 + (-6) = 36 36 + (-6) = 30 30 + (-6) = 24 24 + (-6) = 18 Cycle 1 Cycle 2 Cycle 3 Cycle 4 18 + (-6) = 12 12 + (-6) = 6 6 + (-6) = 0 42 / 6 = 7 Cycle 5 Cycle 6 Cycle 7

Sign Extension Extending a number representation to a larger number of bits. Example: 2 in 8 bit binary to 16 bit binary. 23 00000010 00000000 00000010 In signed numbers, it is important to extend the sign bit to preserve the number (+ve or ve) Example: -2 in 8 bit binary to 16 bit binary. 11111110 11111111 11111110 Sign bit Sign bit extended Sign bit

Detecting Overflow in Two Complement Numbers Overflow occurs when adding two positive numbers and the sum is negative, or vice versa A carry out occurred into the sign bit Overflow conditions for addition and subtraction 24

Overflow Rule for addition If 2 Two's Complement numbers are added, and they both have the same sign (both positive -7 or both = 1001 negative), -6 then = overflow 1010 occurs if and only if the result has the opposite sign. Adding two positive numbers must give a positive result Adding two negative numbers must give a negative result Overflow occurs Overflow never occurs when adding operands with different signs. Overflow occurs if 25 (+A) + (+B) = C ( A) + ( B) = +C 1001 + 1010 ------------ 1 0011 (3) Example: Using 4-bit Two's Complement numbers ( 8 x +7) (-7) + (-6) = (-13) but Overflow (largest ve number is 8) The sign bit has changed to +ve

Overflow Rule for Subtraction If 2 Two's Complement numbers are subtracted, and their signs are different, then overflow occurs if and only if the result has the same sign as the subtrahend. subtrahend Overflow occurs if (+A) ( B) = C ( A) (+B) = +C result Example: Using 4-bit Two's Complement numbers ( 8 x +7) Subtract 6 from +7 (i.e. 7 (-6)) 26 result has the same sign as the subtrahend overflow happens

Overflow Rule for Subtraction If 2 Two's Complement numbers are subtracted, and their signs are different, then overflow occurs if and only if the result has the same sign as the subtrahend. Overflow occurs if (+A) ( B) = C ( A) (+B) = +C 7 = 0111 Example: Using 4-bit Two's Complement numbers ( 8 x +7) Subtract 6 from +7 (i.e. 7 (-6)) 27 Overflow occurs 6 = 0110 0111 + 0110 ------------ 1101 (-3) Result has same sign as subtrahend

Addition A little summary Add the values, discarding any carry-out bit Subtraction Negate the subtrahend and add, discarding any carry-out bit Overflow 28 01010000 + 01010000 ------------------ 1010000 (-96) Not 160 because the sign bit is 1. (largest +ve number in 8 bits is 127) Occurs when adding two positive numbers produces a negative result, or when adding two negative numbers produces a positive result. Adding operands of unlike signs never produces an overflow Notice that discarding the carry out of the most significant bit during Two's Complement addition is a normal occurrence, and does not by itself indicate overflow As an example of overflow, consider adding (80 + 80 = 160) 10, which produces a result of 96 10 in 8-bit two's complement.

Binary Multiplication 29

Two's Complement Multiplication There are a couple of ways of doing two's complement multiplication by hand. Remember that the result can require 2 times as many bits as the original operands. 30

Sign Extend Method" for Two's Complement Multiplication In 2's complement, sign extend both integers to twice as many bits. Then take the correct number of result bits 00000010 from the least significant portion of the result A 4-bit example: 2 x (-3) 2= 0010 2= 0000 0010-3 = 1111 1101 Sign extend to 8 bit 31 00000010 x 11111101 --------------------------- 00000000 00000010 00000010 Correct answer underlined -6 00000010 00000010 00000010 00000010 --------------------------- 000000111111010

" Partial Product Sign Extend Method " for Two's Complement Multiplication 32 Another way is to sign extend the partial products 1100 to the correct number of bits. x 0011 Sometimes we do have to make some adjustments. 11111100 Example 1: (-4) x 3-4 = 1100 3= 0011 Sign extend to the correct number of bits 8 --------------------------- 1111100 000000 --------------------------- 11110100 (-12)

" Partial Product Sign Extend Method " for Two's Complement Multiplication Sometimes we do have to make some adjustments. If (+ve) x (+ve) then OK Normal stuff If (+ve) x (-ve) then get additive inverse of both And then Sign extend partial products Example: 3 x (-4) 33 If (-ve) x (+ve) then Sign extend partial products 3=0011 ;-4 = 1100-3=1101; 4 =0100 If (-ve) x 1101 (-ve) then get x additive 0100 inverse of both --------------------------- 00000000 0000000 111101 --------------------------- 11110100 (-12) Like the slide before (-4)x3

" Partial Product Sign Extend Method " for Two's Complement Multiplication Sometimes we do have to make some adjustments. 34 If (+ve) x (+ve) then OK -3=1101 3 = 0011 0011 If (+ve) x (-ve) then x 0011 get additive inverse of both 0011 And then 0011 Sign extend partial products 01001 (9) Example: 3 x (-4) --------------------------- --------------------------- If (-ve) x (+ve) then Sign extend partial products If (-ve) x (-ve) then get additive inverse of both Example: (-3)x(-3)

Signed Multiplication Another way to deal with signed numbers. 35 First convert the multiplier and multiplicand to positive numbers and then remember the original signs Leave the sign out of the calculation To negate the product only if the original signs disagree

1st Version of Multiplication Hardware Actual implementations are far more complex, and use algorithms that generate more than one bit of product each clock cycle. 32-bit multiplicand starts at right half of multiplicand register Algorithm Flows of 1st Version Multiplication Multiplier0 = 1 Start 1. Test Multiplier0 Multiplier0 = 0 Multiplicand 64 bits Shift left 1a. Add multiplicand to product and place the result in Product register 64-bit ALU Multiplier Shift right 32 bits 2. Shift the Multiplicand register left 1 bit Product 64 bits Write Control test 3. Shift the Multiplier register right 1 bit Product register is initialized at 0 Multiplicand register, product register, ALU are 64-bit wide; multiplier register is 32-bit wide 32nd repetition? No: < 32 repetitions Yes: 32 repetitions Done

Example of Multiplication 4 bits Example : 2 x 3 =? Multiplicand (MC) Multiplier (MP) Product (P) 2 x 3 0010 x 0011 Steps: 1a test multiplier (0 or 1) If 1 then P = P + MC If 0 then no operation 2 shift MC left 3 shift MP right All bits done? If still <max bit, repeat If = max bit, stop 37 Multiplier0 = 1 1a. Add multiplicand to product and place the result in Product register Start 0001 0011 1. Test Multiplier0 2. Shift the Multiplicand register left 1 bit 3. Shift the Multiplier register right 1 bit 32nd repetition? Multiplier0 = 0 P = P + MC = 00000000 + 00000010 = 00000010 MC = 00000010 00000100 MP = 0011 0001 No: < 32 repetitions Max bit? NO repeat Yes: 32 repetitions Done

Iteration Step Multiplier (MP) Multiplicand (MC) Product (P) 0 Initial value 0011 0000 0010 0000 0000 0000 0010 1 2 3 4 1a:1 P = P + MC 2: Shift MC left 3: Shift MP right 1a:1 P = P + MC 2: Shift MC left 3: Shift MP right 1a:0 no operation 2: Shift MC left 3: Shift MP right 1a:0 no operation 2: Shift MC left 3: Shift MP right 0001 0000 Try with 5 x 4 0000 0000 0000 0100 0000 1000 0001 0000 0010 0000 0000 0110 38

Iteration Step Multiplier (MP) Multiplicand (MC) Product (P) 0 Initial value 0100 0000 0101 0000 0000 1 2 3 4 Example : 5 x 4 1a:0 no operation 2: Shift MC left 0000 1010 3: Shift MP right 0010 1a:0 no operation 2: Shift MC left 0001 0100 3: Shift MP right 0001 1a:1 P = P + MC 0001 0100 2: Shift MC left 0010 1000 3: Shift MP right 0000 1a:0 no operation 2: Shift MC left 0101 000 3: Shift MP right 0000 Try with 2 x (-3) 20 40

Iteration Step Multiplier (MP) Multiplicand (MC) Product (P) 0 Initial value 0011 1111 1110 0000 0000 1a:1 P = P + MC 1111 1110 1 2 3 4 Example : 2 x (-3) get additive inverse of both 2: Shift MC left 1111 1100 3: Shift MP right 0001 1a:1 P = P + MC 1111 1010 2: Shift MC left 1111 1000 3: Shift MP right 0000 1a:0 no operation 2: Shift MC left 1111 0000 3: Shift MP right 0000 1a:0 no operation 2: Shift MC left 1110 0000 3: Shift MP right 0000-6 41

Binary Division 42

1st Version of Division Hardware Divisor starts at left half of divisor register 32-bit divisor starts at left half of divisor register Divisor 64 bits Shift right Quotient register is initialized to be 0 43 Flows of 1st Version Division Start 1. Subtract the Divisor register from the Remainder register and place the result in the Remainder register Remainder > 0 Test Remainder Remainder < 0 64-bit ALU Remainder Write Control test Quotient Shift left 32 bits 2a. Shift the Quotient register to the left, setting the new rightmost bit to 1 2b. Restore the original value by adding the Divisor register to the Remainder register and place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0 64 bits Remainder register is initialized with the dividend at right 3. Shift the Divisor register right 1 bit Divisor register, remainder register, ALU are 64-bit wide; quotient register is 32-bit wide 33rd repetition? No: < 33 repetitions Yes: 33 repetitions Done Algorithm

Example of Division 4 bits 44 Start 1. R = R - D Example : 7 / 2 =? Dividend (DD) Divisor (D) 7 / 2 0111 / 0010 Steps: 1 Remainder (R) = R D 2 test new R (>=0 or <0) 2a - If >=0 then R = no operation; Q = Shift left (add 1 at LSB) 2b - If <0 then R = D + R Q = Shift left (add 0 at LSB) 3 shift D right All bits done? If still <(max bit + 1), repeat If = (max bit+1), stop Quotient (Q) If R = R D = +ve 2a. Shift the Quotient register to the left, setting the new rightmost bit to 1 1. Subtract the Divisor register from the Remainder register and place the result in the Remainder register Remainder > 0 2a. R >= 0; R = no operation Q = Shift left (add 1 at LSB) 4. If not yet 33 repeat to Step 1 (new iteration) Test Remainder 3. D = Shift right 1 3. Shift the Divisor register right 1 bit 33rd repetition? If R = R D = -ve Remainder < 0 2b. R < 0; R = D + R Q = Shift left (add 0 at LSB) 2b. Restore the original value by adding the Divisor register to the Remainder register and place the sum in the Remainder register. Also shift the Quotient register to the left, setting the new least significant bit to 0 No: < 33 repetitions Yes: 33 repetitions Done

Iteration Step Quotient (Q) Divisor Divisor Remainder ( (D) starts at left half Example: 7/2 R) of divisor register 0 Initial value 0000 0010 0000 0000 0111 1 1. R = R - D R = R D = R + (-D) Negate 0010 0000 1110 0111 2b. R < 0; R = D + R 1110 0000 0000 0000 0111 0111 1110 0000 + Q = Shift left (add 0 at LSB) 0000 --------------- 3. D = Shift right 0001 0000 1110 0111 1. R = R - D 1111 0111 2 2b. R < 0; R = D + R 0000 0111 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0000 1000 1. R = R - D 1111 1111 3 2b. R < 0; R = D + R 0000 0111 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0000 0100 46

Iteration Step Quotient (Q) Divisor (D) Remainder(R) 3 4 5 1. R = R - D 1111 1111 2b. R < 0; R = D + R 0000 0111 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0000 0100 1. R = R - D 0000 0011 2b. R >=0; R = no operation Q = Shift left (add 1 at LSB) 0001 3. D = Shift right 0000 0010 1. R = R - D 0000 0001 2b. R >=0; R = no operation Q = Shift left (add 1 at LSB) 0011 3. D = Shift right 0000 0001 3 1 Example: 7/2 = 3 remainder 1 47

Example: 6/3 Iteration Step Quotient (Q) Divisor (D) Remainder(R) 0 Initial value 0000 0011 0000 0000 0110 1. R = R - D 1101 0110 1 2b. R < 0; R = D + R 0000 0110 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0001 1000 1. R = R - D 1110 1110 2 2b. R < 0; R = D + R 0000 0110 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0000 1100 1. R = R - D 1111 1010 3 2b. R < 0; R = D + R 0000 0110 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0000 0110 48

Iteration Step Quotient (Q) Divisor (D) Remainder (R) 3 4 5 1. R = R - D 1111 1010 2b. R < 0; R = D + R 0000 0110 Q = Shift left (add 0 at LSB) 0000 3. D = Shift right 0000 0110 1. R = R - D 0000 0000 2b. R >=0; R = no operation Q = Shift left (add 1 at LSB) 0001 3. D = Shift right 0000 0011 1. R = R - D 1111 1101 2b. R < 0; R = D + R 0000 0000 Q = Shift left (add 0 at LSB) 0010 3. D = Shift right 0000 0001 2 0 Example: 6/3 = 2 49

Signed Division Signed divide: make both divisor and dividend positive and perform division negate the quotient if divisor and dividend were of opposite signs make the sign of the remainder match that of the dividend this ensures always dividend = (quotient divisor) + remainder quotient (x/y) = quotient ( x/y) e.g. 7 = 3 2 + 1 7 = 3 2 1) 50

Module 2 Part 2 51 FLOATING POINT

Floating Point Representation Floating point aids in the representation of very big or very small fixed point numbers. 52 10000000000 1.0 x 10 10 Fixed point Floating point 976,000,000,000,000 9.76 x 10 14 0.0000000000000976 9.76 x 10-14

Floating Point Numbers 53 Significand/Fraction Exponent 736.637 x 10 68 Base/radix Significand and Exponent can be +ve or ve. Decimal numbers use the radix 10, binary use 2

Normalized and Unnormalized In generalized normalization (like in mathematics), a floating point number number is said to after be normalized the if the number after the radix radix point point is a non-zero is a value. Un-normalized floating number is when the number after the radix point is 0. Example: 55 0.1234 x 10 16 0.0011 x 10 15 11.0123 x 10 11 0.133 x 10 5 unnormalized non-zero value. unnormalized normalized normalized

Normalization Process Normalization is the process of deleting the zeroes until a Move non-zero radix point value to is detected. the right (in this case 2 Example points) : 56 Move radix point to the 0.00234 left (in this x case 10 4 2 points) 12.0024 x 10 4 0.234 x 10 4-2 0.234 x 10 2 0.120024 x 10 4+2 0.234 x 10 6 A rule of thumb: moving the radix point to the right subtract exponent moving the radix point to the left add exponent

Floating Point Format for Binary Numbers In the beginning, the general form of floating point is : +/- 0.Mantissa x r +/- exponent In binary: +/- (sign) Exponent +/- (sign) Mantissa 1 word Mantissa = Significand The 2 sign bits are not good for design as it incurs extra costs

Biased Exponent A new value to represent the exponent without the sign bit is introduced 59 This will eliminate the sign for the exponent value that is the exponent will be positive. (indicative) +/- (sign) Biased Exponent Mantissa 1 word +/- (1 bit) E t (n bit) Mantissa, m

Biased Exponent 60 +/- (1 bit) E t (n bit) Mantissa, m Biased value, b = 2 n-1 Normalized exponent, e = E t - b Biased exponent, E t = e + b Where, E t = biased exponent n = bits of exponent format (i.e. the word format) This is used unless the IEEE standard is mentioned then is a different calculation

Conversion to Floating Point Number Change to binary (if given decimal number) Normalized the number Change the number to biased exponent Form the word (3 fields) 61

Example 3 62 Transform -33.625 to floating point word using the following format (radix 2) Sign Biased exponent Significand 1 bit 4 bit 12 bit Step 1 : Change 33.625 to binary This is the given word format 33 0100001 0.625 x 2 = 1.25 1 0.25 x 2 = 0.5 0 0.5 x 2 = 1.0 1 0.0 0.625 =.101 33.625 = 0100001.101 = 0.100001101 x 2 0110

Example 3 Step 2 : Normalized the number 0100001.101 0.100001101 x 2 0110 63 Normalized exponent, e Step 3: Change the number to biased exponent 1 bit 4 bit 12 bit Biased value, b = 2 n-1 = 2 4-1 = 8d = 1000b Biased exponent, E t = e + b = 0110 + 1000 = 1110 0.100001101 x 2 0110 0.100001101 x 2 1110

Example 3 64 0.100001101 x 2 1110 Step 4 : Form the word (3 fields) 1 bit 4 bit 12 bit Padding 1 1110 100001101000 Rule of thumb: -the biased exponent is always padded to the left - the significand (or mantissa) is always padded to the right

Floating-Point Representation Value in general form: (-1) S x F x 2 e In an 8-bit representation, we can represent: From 2 x 10-38 to 2 x 10 38 This is called single-precision = 1 word If anything goes above or under, then overflow and underflow happens respectively. One way to reduce this is to offer another format with a larger exponent use double precision (2 words) From 2 x 10-308 to 2 x 10 308 65 65

IEEE 754 Floating-Point Standard 69 This 1 is made implicit to pack more bits into the significand

Normalized Scientific Notation In IEEE standard normalization (used in computers), a floating point number is said to be normalized if there is only a single non-zero before the radix point. Example: 70 there is only a single non-zero before the radix point. 123.456 normalized 1.23456 x 10 2 1010.1011 B normalized 1.0101011 x 2 011

Challenge of Negative Exponents Placing the exponent before the significand simplifies sorting sign of floating-point Exponent numbers using Significand integer comparison 1 bit instructions. 8 bit 23 bit However, using 2 s complement in the exponent field makes a negative exponent look like a big number. 72

Biased Notation 73 Bias In single precision is 127 In double precision 1023 Biased notation (-1)sign x (1 + Fraction) x 2 (exponent-bias)

74 To convert a decimal number to single (or double) precision floating point: Step 1: Normalize Step 2: Determine Sign Bit Step 3: Determine exponent IEEE 754 Conversion Step 4: Determine Significand

IEEE 754 Conversion : Example 1 Convert 10.4 d to single precision floating point. Step 1: Normalize 10 00001010 0.4 x 2 = 0.8 0 0.8 x 2 = 1.6 1 0.6 x 2 = 1.2 1 0.2 x 2 = 0.4 0 0.4 x 2 = 0.8 0 0.8 x 2 = 1.6 1 75 For continuous results, take the 1 st pattern before it repeats itself 0.4 =.0110 10.4 = 1010.0110 x 2 0 1.0100110 x 2 3

IEEE 754 Conversion : Example 1 76 Step 2: Determine Sign Bit (S) Because (10.4) is positive, S = 0 3 is from 2 3 Step 3: Determine exponent Because its single precision bias = 127 Exponent = 3 + bias = 3 + 127 = 130 d = 1000 0010 b Step4: Determine Significand Drop the leading 1 of the significand 1.0100110 x 2 3 0100110 Then expand (padding) to 23 bits 01001100000000000000000 sign Exponent Significand 0 10000010 01001100000000000000000 76

IEEE 754 Conversion : Example 2 Convert -0.75 d to single precision floating point. Step 1: Normalize 77 0.75 x 2 = 1.5 1 0.5 x 2 = 1.0 1 0.0 x 2 = 0 0-0.75 = - 0.11-0.11 x 2 0-1.1 x 2-1

IEEE 754 Conversion : Example 2 78 Step 2: Determine Sign Bit (S) Because (-0.75) is negative, S = 1 Step 3: Determine exponent Because its single precision bias = 127 Exponent = -1 + bias = -1 + 127 = 126 d = 01111110 b Step4: Determine Significand Drop the leading 1 of the significand -1.1 x 2-1 0.1 Then expand (padding) to 23 bits 10000000000000000000000 sign Exponent Significand 1 01111110 10000000000000000000000 78

IEEE 754 Conversion : Example 3 Convert -0.75 d to double precision floating point. Step 1: Normalize 79 0.75 x 2 = 1.5 1 0.5 x 2 = 1.0 1 0.0 x 2 = 0 0-0.75 = - 0.11-0.11 x 2 0-1.1 x 2-1

IEEE 754 Conversion : Example 3 80 Step 2: Determine Sign Bit (S) Because (-0.75) is negative, S = 1 Step 3: Determine exponent Because its double precision bias = 1023 Exponent = -1 + bias = -1 + 1023 = 1022 d = 01111111110 b Step4: Determine Significand Drop the leading 1 of the significand -1.1 x 2-1 0.1 Then expand (padding) to 52 bits 10000000000000000000000..00 sign Exponent (11) Significand (52) 1 01111111110 1000000000000000000..00 80

Converting Binary to Decimal Floating-Point What decimal number is represented by this single precision float? Extract the values: Sign = 1 81 Sign (1 bit) Exponent(8 bit) Significand(23 bit) 1 10000001 01000000000000000000000 Exponent = 10000001 b = 129 d Significand Remember: Biased notation (-1)sign x (1 + Fraction) x 2 (exponent-bias) The Fraction = (0 x 2-1 ) + (1 x 2-2 ) + (0 x 2-3 ) = ¼ = 0.25 -(1 + 0.25) The number = - (1.25 x 2 (exponent-bias) ) = - (1.25 x 2 2 ) = - (1.25 x 4) = -5.0

Module 2 Part 3 82 FLOATING-POINT OPERATIONS

Floating-Point Addition Flows 83

Decimal Floating-Point Addition 84 Assume 4 decimal digits for significand and 2 decimal digits for exponent Step 1: Align the decimal point of the number that has the smaller exponent Step 2: add the significand Step 3: Normalize the sum check for overflow/underflow of the exponent after normalisation Step 4: Round the significand If the significand does not fit in the space reserved for it, it has to be rounded off Step 5: Normalize it (if need be)

Decimal Floating-Point Addition Example: 9.999 d x 10 1 + 1.610 d x 10-1 Step 1: Align the decimal point of the number that has the smaller exponent Make 1.610 d x 10-1 to 10 1-1 + x = 1 x = 2 move 2 to left 0.0161 d x 10 1 Step 2: add the significand 9.9990 x 10 1 + 0.0161 d x 10 1 ----------------------- 10.0151 x 10 1 85

Decimal Floating-Point Addition Example: 9.999 d x 10 1 + 1.610 d x 10-1 Step 3: Normalize the sum 10.0151 x 10 1 1.00151 x 10 2 86 Step 4: Round the significand (to 4 decimal digits for significand) 1.00151 x 10 2 1.0015 x 10 2 Step 5: Normalize it (if need be) No need as its normalized

Decimal Floating-Point Addition Example: 0.5 d + (-0.4375 d ) Adjusts the numbers Step 1: Align the decimal point of the number that has the smaller exponent 88 0.5 0.10 b x 2 0 1.0 b x 2-1 -0.4375-0.0111 b x 2 0-1.11 b x 2-2 Make 1.11 b x 2-2 to 2-1 -2 + x = -1 x = 1 move 1 to left - 0.111 b x 2-1

Decimal Floating-Point Addition Example: 0.5 d + (-0.4375 d ) Step 2: add the significand Step 3: Normalize the sum 0.001 2-1 1.000 2-3 Step 4: Round the significand (to 4 decimal digits for significand) Fits in the 4 decimal digits Step 5: Normalize it (if need be) No need as its normalized 89 1.000 x 2-1 + -0.111 x 2-1 ---------------------- 0.001 2-1

Floating-Point Multiplication Flows 90

Floating-Point Multiplication 91 Assume 4 decimal digits for significand and 2 decimal digits for exponent Step 1: Add the exponent of the 2 numbers Step 2: Multiply the significands Step 3: Normalize the product check for overflow/underflow of the exponent after normalisation Step 4: Round the significand If the significand does not fit in the space reserved for it, it has to be rounded off Step 5: Normalize it (if need be) Step 6: Set the sign of the product

Floating-Point Multiplication 92 Example: (1.110 d x 10 10 ) x (9.200 d x 10-5 ) Assume 4 decimal digits for significand and 2 decimal digits for exponent Step 1: Add the exponent of the 2 numbers 10 + (-5) = 5 If biased is considered 10 + (-5) + 127 = 132 Step 2: Multiply the significands 9.200 x 1.110 -------------- 92000 9200 9200 -------------- 10212000 10.212000 10.2120 x 10 5

Floating-Point Multiplication Example: (1.110 d x 10 10 ) x (9.200 d x 10-5 ) Step 3: Normalize the product 93 10.2120 x 10 5 1.02120 x 10 6 Step 4: Round the significand (4 decimal digits for significand) 1.0212 x 10 6 Still normalized Step 5: Normalize it (if need be) Step 6: Set the sign of the product +1.0212 x 10 6

Floating-Point Multiplication Example: (1.000 b x 2-1 ) x (-1.110 b x 2-2 ) 95 Assume 4 decimal digits for significand and 2 decimal digits for exponent Step 1: Add the exponent of the 2 numbers -1 + (-2) = -3 If biased is considered -1 + (-2) + 127 = 124 Step 2: Multiply the significands 1.110 x 1.000 -------------- 1110000 1.110000 1.110000 x 10-3

Floating-Point Multiplication Example: (1.000 b x 2-1 ) x (-1.110 b x 2-2 Step 3: Normalize the product 1.110000 x 10-3 already normalized 96 Step 4: Round the significand (4 decimal digits for significand) 1.1100 x 10-3 Still normalized Step 5: Normalize it (if need be) Step 6: Set the sign of the product -1.1100 b x 10-3 -7/32 d

Floating-Point ALU Sign Exponent Significand Sign Exponent Significand 97 Small ALU Compare exponents Exponent difference 0 1 0 1 0 1 Control Shift right Shift smaller number right Big ALU Add 0 1 Increment or decrement 0 1 Shift left or right Normalize Rounding hardware Round Sign Exponent Significand

Accurate Arithmetic If a calculation exceeds the limits of the floating point scheme then CPU will flag this error. 98 If number is too tiny to be represented

Accurate Arithmetic : Truncation & Rounding 99 Some number have infinite decimal points (the irrational numbers) 1/3 d = 0.3333333333 Truncation is done to fit the decimal points into manageable units. Truncation is where decimal values beyond the truncation point are simply discarded and it can cause error in floating point calculations. Rounding :If you have a number such as 3.456 then if you have to round this to 3 significant digits, the number becomes 3.46 A small error called the rounding error has occurred ***Note : the CPU will not flag any error when truncation and rounding occurs, as it is acting within its limits. programmers must assess if this will lead to significant errors