Representation of Numbers and Arithmetic in Signal Processors

Representation of Numbers and Arithmetic in Signal Processors 1. General facts Without having any information regarding the used consensus for representing binary numbers in a computer, no exact value can be attributed to a binary number, the number being represented by a string of bits. For instance, what is the decimal value for 10111011B = X? Is it a positive or a negative value? It depends on the representation used. In choosing a Digital Signal Processor (DSP) for a certain application one of the important criteria in the decision process is the data (binary) representation by the processor. From this point of view the commercial market of DSPs can be classified as shown in figure 1. Fig. 1 Regular DSPs number representation The fixed point arithmetic was used by the first DSPs and is still used in most DSPs nowadays. Fixed point DSPs can represent numbers either by: integer (integer arithmetic) used by a DSP in control operations, computing of addresses and other operations that cannot concern signals, or fractional (fractional arithmetic) with values between -1 and +1, useful in signal computations The algorithms and the hard drive used in implementing the fractional arithmetic are virtually identical to the ones used by the integer arithmetic. The main difference between the two types of arithmetic is in the usage of the results of the multiplication operations. Most of the DSPs in fixed point accept both arithmetic. Because the numerical systems use a word of a certain length for numeral representation, arithmetical operations will thus be executed with a certain (limited) precision. Numbers will be 1

thus represented on a circle (ring) instead of the infinite real axis, a situation in which arithmetical overflow may occur. In the case of floating point DSPs, the values are represented by a mantissa and an exponent according to the relationship mantissa*2 exponent. The mantissa is generally a fractional number between -1.0 and +1.0, whereas the exponent is an integer that represents the number of positions that the binary point (term defined in analogy with the decimal point) must be moved leftward to obtain the represented value. Floating point processors are easier to program in comparison with those in fixed point, but are more expensive as well. This is due to the increased complexity of the circuit that determines a greater surface of the chip. The ease of programming for these processors is given by the fact that the programmer has no recurring need to manage the overflow cases of the buffer as with the case of the fixed point processors (this management represents periodical scaling of the result in different phases of development). Most of the low cost applications require the use of fixed point processors. In this case, programming requires usually no difficulties because the numerical simulation phase of the algorithm (the one phase that precedes development), easily detects all the situations in which a correction of the result in necessary to avoid saturation. For a better understanding of the overflow phenomena, we will present two instances which are to be met when dealing with DSPs or processors supporting the MMx technology. This are: wraparound (modulo 2 n ) and saturation. For example, we will present an image based application: having the original image on a grayscale (Fig.2a) we can obtain the wraparound effect (Fig. 2b) and the saturation effect (Fig. 2c) using the following processing steps: - if the addition result between pixels and the value produces overflow, the result is truncated, taking in consideration the least significant bits (the wraparound effect), due to the limitation to n bits; - if the addition result produces overflow, saturation occurs and the result is limited to the maximum value of the domain; Fig2. a) Grayscale original image; b) Modulo 2 n operation c) Saturation Small numbers represent areas with dark (black) grayscale, while large numbers represent lightly colored areas (white). By using 8 bits for representing the pixel values in the image, we will obtain the [0; 255] range, with 0 standing for black and 255 for white. In order to lighten the original image (Fig. 2a), we can add an integer positive number (e.g. 64 10 =40 h ) for each pixel in the image. 2

In the case of the wraparound effect (modulo 2 8 ), if there is any overflow (the value of a pixel overflows the maximum threshold of 255), the final result will be truncated such that only the least significant bits will be retained. For instance, if we add 64 10 to 250 10 (almost white), we will obtain: 250 decimal 1111.1010 binary + 64 decimal + 0100.0000 binary ----------------------- ------------------------------- = 314 decimal = 10011.1010 binary -an overflow occurs = 58 decimal = 0011.1010 binary -we keep only the least significant 8 bits The result is 58 10, creating a dark colored area (close to black) instead of a lighter colored one, as expected. A reversed effect to the desired one was obtained (shade inversion) because the light shade values have become dark shade areas. In case of saturation, by adding a value to every pixel in the original image, light colored areas will become purely white. Saturation threshold in this case is the maximum value that can be represented on 8 bits, that is 255. Fig. 3 a)obtaining the pixel values in case of modulo 2 8 b)obtaining pixel values in case of saturation 2. Integer representation On an n bit word we can represent 2 n numbers in the range of numbers equally spaced by the quantization step q=1. The unsigned binary representation, for an n bit word is: B 2 = b n-1 b n-2. b 1 b 0 with the decimal value: B 10 = b n-1 *2 n-1 + +b 1 *2+b 0 placed in the [0, 2 n -1] range. Figure 4 shows the wheel of numbers on which several properties of addition can be verified for values represented by a finite number of bits, in this case n=4. Fig. 4 The wheel of unsigned numbers represented on 4 bits 3

Example: By using the 4 bit unsigned representation we will increment the greatest representative number this way: 1111 2 = 15 10 1 2 = 1 10 1 0000 2 = 0 10, the number10000 2 is no longer a 4 bit number, overflow has occurred! Transport bit Binary unsigned numbers on n bits represent a modulo 2 n system! In order to represent binary negative numbers (with sign) we use the representation in two s complement. A two complement number on n bits is: B 2 = b n-1 b n-2 b 1 b 0, with the decimal value: B 10 = -b n-1 *2 n-1 + +b 1 *2 + b 0 placed in the [-2 n-1, 2 n-1-1] range. Fig. 5 The wheel of signed numbers represented on 4 bits Observation: In the decimal representation 100 10 and 0100 10 are identical, but in the two s complement representation for numbers 1010 2 and 01010 2 we cannot assume the same thing. Why is that? In the decimal representation we use as a sign a special character (-), while in the binary representation the sign is given by the sign bit. This is why the decimal value of the number 1010 2 depends on the length on which it is represented. If on 5 bits the value is 10 10, whereas on 4 bits the value is -6 10. Properties: The product of 2 integer binary numbers in fixed point represented on n bits will generate a result on 2n bits The addition of 2 numbers represented on n bits can gine a result on n+1 bits The multiplication of 2 integers in fixed point has a great probability of overflow. 4

On N successive additions log 2 (N) supplementary bits are needed to avoid overflow. Example: By using the fixed point representation on 4 bits is to be done the following multiplication: 3 10 with 2 10. Observation: The result is an 8 bits number and not to be forgotten is the sign extension regarding the length of the resulting word. 1101 = -3 10 0010 = 2 10 ----------------------------------------- 00000000 1111101 000000 00000 ----------------------------------------- 11111010 = -6 10 If we go back to the 4 bit representation the result will be 1010 2 = -6 10, so it is correct. Example: Compute the product in fixed point on 4 bits between the numbers -3 10 and 6 10. 1101 = -3 10 0110 = 6 10 ----------------------------------------- 00000000 1111101 111101 00000 ----------------------------------------- 11101110 = -18 10 If we go back to the 4 bit representation the result will be 1110 2 = -2 10, so it is not correct. Properties: For any sequence of operations in which the final result can be correctly represented in the given range we can calculate correctly the final result even if overflows appear on intermediate results. Example: By using the 4 bit signed representation ([-8, 7) range) calculate the expression: 7+6-8 = 5 0111+ 7+ 0110 6 ----------------------------------- 1101+ -3+ (overflow) 1000-8 ----------------------------------- 0101 5 (correct) 5

3. Fractional representation In order to avoid the overflow issue, the fractional representation in fixed point limits or normalizes the numbers in the range [-1.0, 1.0] solving the problem. Exception: (-1)*(-1) = +1! which does not belong to the [-1, +1) range. Overflow is still possible when adding or subtracting! Observations: 1. The result of multiplication is truncated resulting in precision losses by removal of the least significant bits. 2. The fractional representation on n bits is obtained through shifting the point with n-1 positions to the left, meaning that the representation contains a sign bit, radix point and n- 1 fractional bits, so: B 2fractional = b 0 b 1..b n-2 b n-1, with the decimal value B10 = -b 0 * 2 0 + +b 1 * 2-1 +..+ b n-1 * 2 -(n-1). 3. The quantization step is q=2 -(n-1). The fractional representation is also known as the Qx representation, where x is the n bits fractional number. The total number of bits of the representation is x+1. This representation is usual for fixed point (16 bits) DSPs is Q15. The point location is not necessarily specified this being a programming instrument. Therefore by using the 16 bit representation, the range of numbers can be: [0, 65535] - on the unsigned integer representation [-32768, +32767] - on the signed integer representation [-1, +0.999969] - on the fractional representation All this variants are used on the DSP applications. Fig. 6 The wheel of numbers for 4 bits fractional representation 6

Observation: By adding two binary numbers they must be represented through the same convention. Examples: Represent in the following format: a) Q3, the numbers -0.375 and 0.75 In fixed point DSPs the radix point (decimal) is not available, because it is the programmer s duty. In arithmetical operations with fractional data or signed we use the same logic and arithmetic unit (UAL). Fractional representation is obtained by scaling the value 2 x where x is the number of fractional bits. Therefore: -0.375 10 = 1.101 2 * 2 3 = 1101 2 = -3 10 0.75 10 = 0.110 2 * 2 3 = 0110 2 = 6 10 b) Q15, the number X 10 = -8.9969653E-003 We compute X*2 15 = -294.8 (~ -295), which if written as a two s complement is X Q15 = 0FED9h. Multiply by using the fractional representation on 4 bits: c) -0.5 10 with 0.75 10 Observation: Pay attention to the sign extension. The result will have 6 fractional bits out of 8. 1.110 = -0.5 10 0.110 = 0.75 10 --------------------------------------- 00000000 1111100 111100 00000 --------------------------------------- 11.101000 2 = -0.375 10 Supplementary sign bit We go back to the 4 bits representation: 1 1.101 000 2 =1.101 2 = -0.375 10 which is a correct number. d) -0.5 10 with 0.625 10 1.100 * 0.101 = 11.101100 2 = -0.3125 10 We go back to the 4 bits representation: 7

1 1.101 100 2 = 1.101 2 = -0.375 10 which is incorrect due to truncation. e) -0.25 10 with 0.5 10 in Q15 format 0.100 0000 0000 0000 * 1.110 0000 0000 0000 ---------------------------------------------------------- 11.11 1000 0000 0000 0000 0000 0000 0000 Q 30 format We pass on to the Q15 format by eliminating the supplementary sign bit and obtaining: 1.1110 0000 0000 0000 2 = -0.125 10, correct. Advantage: Through the use of binary functions we obtain a higher speed in closed loop computations. Disadvantage: The result may not be exact. As from above the memory stores the value (-4/16). Bits stretched between 2-4 and 2-6 have been truncated. The correct result is (-3/16)! It is worth mentioning that the 4 bit multiplication is not done at a real capacity of C28x operating on 32 bits. In such case truncation will affect bits between 2-32 and 2-64. In most cases only the noise is truncated. Despite of this, some reaction apps (such as IIR filters) can be affected by errors and lead to a certain level of instability. It is the programmer s duty to observe this potential source of failure in view of the use of binary functions. 4. Operations with numbers greater than 1 a) All coefficients in the algorithm are scaled so they belong to the [-1,1) range. The effect is an attenuated output signal that keeps the frequency response. b) The properties of integer multiplication with the unity are used: A*B=(A-1)*B+B Example: Multiply the sample X i =0.625 with the coefficient 1.375. We assume the 4 bit representation. 1.375 * 0.625 = (0.375+1) * 0.625 = 0.375 * 0.625 + 0.625 = 0.859375 0.101 * 0.625 10 * 0.11 0.375 10 ------------------------------------------------------- 00.001111 + 0.234375 10 + 0.101000 0.625 10 ----------------------------------------------------------------------------------- 00.110111 0.859375 10 We come back to the 4 bit representation (Q3): 0.110 2 = 0.75 10 rounding due to limited number of bits. c) Use of 2 s integer multiplication property: A*B=A/2 * B + A/2 * B 8

Example: 0.625 * 1.375 = 0.859375, represent in Q3: 1.375/2 = 0.6875 10 = 0.1011 2 which in rounded Q3 is 0.110 2 = 0.75 10 0.625 10 * 0.75 10 = 0.46875 10 0.46875 10 + 0.46875 10 = 0.9375 10 ~ 0.875 10 0.101 2 * 0.110 2 = 00.011110 2 00.011110 2 + 00.011110 2 = 00.111100 2 ~ 0.111 2 (in Q3 format) 5. Floating point representation The core of a floating point processor is an arithmetical unit that supports floating point operations as stated in the IEEE 754/85 standard. A typical example for this class is the x86 family from Intel, starting with the 486 processor. Floating point processors are very efficient when operating with floating point data and allow a very large range of numerical computations. These processors are not that efficient in task control (bit manipulation, input/output control, interruption response) and besides they are pretty expensive. The IEEE 754/85 includes finite numbers of binary (2 base) or decimal (10 base) nature. The numerical value of the finite number will be given by the formula: (-1) s f b e, where b is the number base. Every number is described through 3 parameters: s-the sign (0 or 1), f-mantissa and e-exponent. For instance, if the sign is 1 (a negative number), the mantissa is 12345, the exponent is -3 and we consider base 10, then the number will be -12.345. Every possible finite value can be represented by a certain format are determined by the: numeration base, the maximum number of digits in the mantissa (that define the precision p) and the maximum value of the exponent, emax. The mantissa needs to be an integer number in the [0; b p -1] domain. The exponent has to be an integer so that 1-emax q+p-1 emax. The IEEE 754/85 standard defines 5 basic forms: three binary forms (that can code numbers on 32, 64 or 128 bits) and two decimals (can code numbers on 64 or 128 bits). The binary format for 32 bits simple precision in floating point describes floating point numbers as: 31 30 23 22 0 s e e e e e e e e f f f f f f f f f f f f f f f f f f f f f f f 1 sign bit, 8 bit exponent, 23 bit mantissa (fractional bits) S=sign bit e e=8 bits that represent the exponent f f=23 bits that represent the mantissa (fractional bits) Advantage: the exponent offers a wide dynamic range of representing numbers; Disadvantage: the precision of number representation depends on the exponent. 9

Floating point representation - Obtaining bits: Sign bit: Negative: bit 31 = 1 Positive : bit 30 = 0 Mantissa: M = 1+m 1 2-1 +m 2 2-2 +...= Exponent: where 1 M<2 An 8 bit signed value, memorized with the offset +127 This way a value computation will be done this way: z = (-1) S M 2 E-OFFSET Examples: Write in a floating point format the next numbers: 1) 0x 3FE0 0000h = 0011 1111 1110 0000 0000 0000 0000 0000 B S = 0 E = 0111 1111 = 127 M = (1).11000 = 1+ 0.5 + 0.25 = 1.75 Z = (-1) 0 * 1.75 * 2 127-127 = 1.75 2) Z = -2.5 S = 1 2.5 = 1.25 * 2 1 1 = E OFFSET E = 128 M = 1.25 = (1).01 = 1 + 0.25 Binar: 1100 0000 0010 0000 0000 0000 0000 0000 B = 0x C020 0000h Floating point representation has also disadvantages: Example *: x = 10.0 (0x41200000) +y = 0.000000238 (0x347F8CF1) ------------------------------------------------ z = 10.000000238 Wrong! The value 10.000000238 cannot be represented with floating point simple precision. 0x412000000 = 10.000000000 10.000000238 <= it cannot be represented 0x412000001 = 10.000000950 In this way, the number 10.000000238 will be rounded to 10.000000000. 10

Properties: The addition of the represented numbers in floating point assumes their representation with the same exponent; the obtained result is then rounded and normalized if needed; The multiplication of two floating point numbers is done by the following rule: the result exponent will be written on a number of bits equal with the sum of bits of the exponents of numbers added, and the obtained result is then rounded and normalized. 6. The IQ format The trend until now was in representing fractional numbers in which the fractional point stands after the MSB. Generally, this point can be put anywhere in the binary representation. A higher resolution can thus be obtained. I -stands for the integer part (integer) Q -stands for fractional part (quotient) Advantage: the precision is the same for all numbers Disadvantage: the dynamic range is limited comparative to the floating point representation 31 0 s i i i i i i i i.qqqqqqqqqqqqqqqqqqqqqqq ---------------------------------------------- 23 bit mantissa -2 I + 2 I-1 +... + 2 1 + 2 0.2-1 + 2-2 +... + 2 -Q Example 1: Format I1Q3 3 0 s.qqq Most negative decimal number: -1.0 = 1.000 Most positive decimal number: +0.875 = 0.111 Lowest negative decimal IQ number: -1*2-3 = 1.111 Lowest positive decimal IQ number: 2-3 = 0.001 Range: -1.0-0.875 Resolution: 2^-3 Example 2: Format I3Q1 3 0 sii.q Most negative decimal number: -4.0 = 100.0 11

Most positive decimal number: +3.5 = 011.1 Lowest negative decimal IQ number: -1*2-1 = 111.1 Lowest positive decimal IQ number: 2-1 = 000.1 Domain: -4.0 +3.5 Resolution: 2-1 Example 3: Format I1Q31 s.qqq qqqq qqqq qqqq qqqq qqqq qqqq qqqq Most negative decimal number: -1.0 = 1.000 0000 0000 0000 0000 0000 0000 0000 B Most positive decimal number: 1-1/2 31 = 0.111 1111 1111 1111 1111 1111 1111 1111 B Lowest negative decimal IQ number: -1*2-31 = 1.111 1111 1111 1111 1111 1111 1111 1111 B Lowest positive decimal IQ number: 2-31 = 0.000 0000 0000 0000 0000 0000 0000 0000 B Domain: [-1.0 +1.0) Resolution: 2-31 If we transpose Example * in IQ, we will have: x = 10.0 (0x0A000000) +y = 0.000000238 (0x00000004) --------------------------------------------------- z = 10.000000238 (0x0A000004), number that can be represented in IQ format By using the IQ format instead of the floating point representation we can obtain a higher precision in data representation. 7. Conclusions Integer versus fractional numbers: Range: integer numbers have a maximum range that is determined by the number of bits on which the number is represented; fractional numbers can be in the [-1;+1] range Precision: integer numbers have a maximum precision of 1; fractional numbers have a precision determined by the number of bits 12

Fixed point versus floating point arithmetic: Floating point arithmetic is more flexible than the fixed point type. The main advantage is in the access to a greater dynamic range for the represented data. Examples of processors: Floating point: Intel Pentium Series, Texas Instruments C67xxDSP Fixed point: Motorola HC68x, Infineon C166, Texas Instruments TMS430, TMS320C5000, C2000 Fixed point processors: Can represent: Integer numbers (integer arithmetic): for control, address computation (signals not implied) Fractional: for signal processing They are low priced. They are easy to program: the simulation step can detect any situation in which correction of the result is necessary to avoid overflow. Floating point processors: Numbers have a mantissasa and an exponent: mantissasa*2 exponent. It is a highly flexible technology. They have a large dynamic range for number representation. They are easy to program: the programmer need not account for the overflow cases of the accumulator. Disadvantage: They are costly. 8. Homework 8.1 There are given below the 15 coefficients of a high pass FIR filter. Convert the coefficients in Q15 format. (h1=h15=8.99696e-3; h2=h14=-7.86406e-3; h3=h13=-3.34992e-2; h4=h12=- 6.53486E-2; h5=h11=-9.8897e-2; h6=h10=-0.12856; h7=h9=-0.14897; h8=0.84375) 8.2 Following the 3 examples in IQ format, specify the values corresponding to I8Q24 format. 8.3 Write in floating point format the number: 0x BFB0 0000h on 32 bits. 8.4 More homework: Study the following examples for addition, substraction and multiplication of numbers in floating point: Example 1. Make the addition of the numbers: 123456.7 and 101.7654 123456.7 = 1.234567 * 10 5 101.7654 = 1.017654 * 10 2 = 0.001017654 * 10 5 13

Detailed: e = 5; z = 1.234567 (123456.7) + e = 2; z = 1.017654 (101.7654) ---------------------------------------------------------------- e = 5; z = 1.234567 + e = 5; z = 0.001017654 (after shifting) ---------------------------------------------------------------- e = 5; sum = 1.235584654 (real sum is: 123558.4654) If we go back to 7 digit representation, the result will be: e=5; sum=1.235585 (meaning the final sum will be: 123558.5), so the result was rounded and normalized. The 3 digits 654 were lost, this beeing the rounding error. Example 2. Make the substraction of the numbers: 123457.1467 and 123456.659: 123457.1467 = 1.234571 * 105 123456.659 = 1.234567 * 105 = 0.000004 * 105 Detailed: e = 5; z = 1.234571 (123457.1467) -e = 5; z = 1.234567 (123456.659) ------------------------------------------------------------------- e = 5; d = 0.000004 (the real difference is: 0.000004877) If we go back to the 7 digit representation the result will be: e= -1; s=4.000000 (meaning the final result will be: 0.4), that is the rounded and normalized result. The runding error in this case is approximately 20%. Example 3. Make the multiplication of the numbers: 4734.612 and 541724.2: e = 3; z = 4.734612 (4734.612) e = 5; z = 5.417242 (541724.2) --------------------------------------------------------------- e = 8; z = 1.234567 e = 8; z = 0.001017654 (after shifting) --------------------------------------------------------------- e = 8; prod = 25.648538980104 (the real result is: 2564853898.0104) e = 8; prod = 25.64854 (after rounding) e = 9; prod = 2.564854 (after normalization) 14