Mike Smith, Quirks and SHARCs. When equals 2, but 2 * 2 is not always 4. Page 1 of 15 Developed 2004/01/26.

Page 1 of 15 Developed 2004/01/26 Quirks and SHARCs When 1 plus 1 equals 2, But 2 times 2 does not always equal 4 Mike Smith, Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, Canada T2N 1N4 Email: smithmr@ucalgary.ca Phone: 403-220-6142 Fax: 403-282-6855 Developed for CCI, March 2000

Page 2 of 15 Developed 2004/01/26 Quirks and SHARCs When 1 plus 1 equals 2, But 2 times 2 does not always equal 4 Mike Smith, Electrical and Computer Engineering University of Calgary, Alberta, Canada When developing, or using, a new embedded system it is important for the developing team to realize that Murphy s Law is always lurking just around the corner. What is really happening when things appear to be going well in the early stages? Simple -- the development team does not understand the project well enough to recognize that things are already going astray! This article is a personal story about one such situation. Several years back I had to switch between an advanced RISC processor (AMD 29050) to the Analog Devices ADSP21061 SHARC processor for DSP applications. There was an initial dramatic learning curve associated with the new SHARC architecture, particularly its super-scalar instruction capability and independent data address generating ALU s. However many of the features that I particularly liked in the AMD29050 processor were again present in ADSP21061. The first year of teaching my applications-oriented advanced microprocessor course on the 21061 went well. The C-like assembly code syntax meant less simple errors were generated than with the 29K RISC assembly syntax. The SHARC EZ-Lite Evaluation boards had on-board CODECs, which allowed the development of interesting audio projects. That first year, I made use of the 21K s floating point capability for development of DSP algorithms without having to worry about the overflow protection needed in integer applications. The confidence was there to get more adventuresome by the start of my second year of teaching with the SHARC board. Certain items from the initial offering were recognized as not particularly useful and were dropped, leaving time for other concepts. In particular the class started to explore using the 21061 processor s integer capabilities to demonstrate DSP programming techniques for processors not having floating point instructions. Once headed down this route, it did not take a particularly difficult algorithm involving integers to get a situation where we realized that things were not working the way we had expected! Figure 1 is a screen capture from the White Mountain VisualDSP SHARC simulator environment. The results are shown for several very basic assembly-code operations. The bit patterns in the data registers have been interpreted using both integer and floating point formats.

Page 3 of 15 Developed 2004/01/26 Figure 1. Basic floating-point addition and multiplication operations on the SHARC processor work as expected. However even the simplest integer addition and multiplication operations apparently give incorrect results. The floating-point results shown in the figure make sense 2.0 + 3.0 = 5.0 2.0 * 3.0 = 6.0 However the integer addition 2 + 3 = 5 has, somehow, been scaled by 0x100 to become the operation 0x200 + 0x300 = 0x500 Although unexpected, the addition operation at least makes some sense which is more than be said for the 21K integer multiplication operation, which appears to imply that 0x200 * 0x300 = 0 As always happens, once you realize that there is one item about a processor you don t completely understand, the floodgates just open. The AMD29K processor has a floating-point instruction for division calculations. It may be 11 times slower than other 29K instructions, but at least it exists, which is certainly not the case for the 21K floating point division instruction! However the 21K instruction set does include a reciprocal instruction, but that does not appear to work as expected either! A third issue involves doing some fancy footwork with the internal operations of the 21K processor to do something neat fast single cycle floating point division. With any integer processor, you can get fast division by powers of 2 by using right arithmetic shift operations as shown by the following 68K instruction. ASR #4, D0 gives D0 / 16 By understanding the bit representation of floating point numbers you can perform comparable tricks to get fast

Page 4 of 15 Developed 2004/01/26 division (scaling) occurring with floating point powers of 2. This knowledge could be particularly useful for algorithms involving the FFT (fast Fourier transform) algorithm. The output from an inverse N-point FFT will require scaling by a factor of N. This fast division operation requires checking and manipulating the binary component bits of the floatingpoint number. It is a little more complicated to achieve this on the 21K than it was on the AMD 29K and Motorola 96000 processors (see CCI-52, 1994). The deep 21K-instruction pipeline means there are several cyclestealing, delayed branch instructions associated with each program decision (conditional JUMP instruction). Some of the problems can be overcome using the 21K super-scalar and conditional compute instructions. To get the most out of any processor, you have to properly understand the consequences of all aspects of that processor s architecture. To this point, the reader has been introduced to some apparently contradictory characteristics of the 21K architecture. In the next few sections we shall show that there is actually a valid design decision underlying every one of these apparent ADSP SHARC processor quirks! The need to make 0x2 appear as 0x200 Because of the characteristics of the 68K-processor architecture, a data register showing the bit pattern 0x800070FF can be interpreted in a number of different ways. For BYTE (8- bit) operations the register contains the negative value 0xFF. The register can also be considered as containing the positive value 0x70FF (16-bit WORD) or the negative value 0x800070FF (32- bit LONG WORD). In Figure 1, the integer 0x2 appears as the hexadecimal number 0x200 as a direct consequence of the internal architecture of the SHARC s data registers and ALU. All data registers are 40-bits wide to allow storage of numbers in both the IEEE 754/854 standard 32-bit single-precision floating-point format and an extendedprecision version of the same format. Unlike the 68K processor, the significant bits on the 21K data values are at the high end of the register bit pattern. This means that the last two nibbles stored in a data register should be ignored during interpretation, or when the value is written out to 32-bit memory. The SHARC internal memory architecture can be software configured to handle both 32-bit and 40-bit data memory operations. Getting 2 times 3 to equal 6 Since there are 32 significant bits hidden at the top of every 40-bit SHARC register it is easy to understand why the operation 0x2 + 0x3 = 0x5 will appear as the operations on register values 0x200 + 0x300 = 0x500 when the register contents are examined. However, it does not explain why the operation

Page 5 of 15 Developed 2004/01/26 0x2 * 0x3 = 0x0 Figure 2 again shows the program from Figure 1. However in this case the display mode for the VisualDSP development environment is set to mixed rather than source. Mixed mode means that you get to see both what you wanted to do (source), and the bit patterns (assembly code) associated with how the processor is actually interpreting those requests! The reason for the strange behaviour is revealed. We did not perform the intended standard multiplication operation R0 * R1 but instead unintentionally activated an operation called R0 * R1 (SSF) A quick glance at the SHARC User Manual indicates that SSF stands for the signed-signed fractional form of the integer multiplication instruction. Figure 2. The result of basic SHARC operations can be found in the upper 32-bits of the 40-bit data register. Using the VisualDSP development environment mixed mode display format, it can be seen that the default SHARC integer multiplication instruction expects a signed-signed fractional (ssf) number representation rather than the standard two s complement integer format. The fact that an instruction comes in both signed and unsigned forms is a familiar concept on many processors. The 68K has 2 types of multiplication operations MULTS (signed) and MULTU (unsigned). Many processors have 3 forms of ADD operations. For examples the 29K processor has ADDU and ADDS instructions which cause exceptions to

Page 6 of 15 Developed 2004/01/26 be thrown when the unsigned and signed number representations overflow during the ADD operation. There is a third plain ADD instruction when you don t need to worry about overflows. However the concept of a fractional representation within an integer format is probably something that will require many an overworked developer to head back to old course notes. For the moment we can simply say what ever fractional means, it ain t what we want and turn it off by using an explicit signed-signed integer form of the 21K multiplication instruction R0 * R1 (SSI) As can be seen Figure 3, we now have the 21K behaving like a normal processor. The 32-bit operations 2 + 3 = 5 and 2 * 3 = 6 both work, even if the direct interpretation of the 32-bit values are distorted a little by their storage in the 40-bit SHARC data registers. Figure 3. The SHARC processor starts acting like any other processor after activating the signed signed integer multiplication operation rather than the default signed signed fractional format Float operations via integer instructions Another useful piece of information has become clear from the mixed mode displays of Figures 2 and 3. Floating point operations of the form F4 = 2.0 are actually implemented through integer assignment operations of the form R4 = bit pattern for the constant.

Page 7 of 15 Developed 2004/01/26 The chip designers decided there was no need to use up precious op-code bits to describe a specific floating point assignment instruction when the assembler is perfectly capable of using an integer assignment in conjunction with generating the bit pattern needed to represent a floating point number. This decision can have a nasty consequence for the developer in a hurry, and not following a code review process. Suppose you write F4 = 2 with the 2 written as an integer rather than a float (2.0). A C language compiler would do the equivalent of automatically casting this expression as F4 = (float) 2 to give the programmer the intended result F4 = 2.0. However there is no equivalent checking of context in the White Mountain 21K assembler. The programmer gets R4 = bit pattern of integer 2 which would leave F4 with a small floating point value around 10-45, which was not what was intended. An example of this can be seen in the different interpretations of the contents of R0 register in Figures 1 -- 3. This problem is even more insidious when initializing floating point arrays for filter coefficients with syntax of the form.var array[3] = {1.0, 2.0, 3}; Two coefficients will be correctly initialized, but the third coefficient will be, unintentionally, far too small. It would nice to see the assembler operation changed so that at least warning messages were issued. Perhaps a language extension could be added to allow automatic type casting with.var float array[3] = {1.0, 2.0, 3}; putting the values 1.0, 2.0 and 3.0 into the array. Experiments with fractions We can t just ignore this fractional integer format. It must be particular useful within the concept of DSP applications, because otherwise why would it be made the default mode of 21K multiplication operations! The Analog Devices 2106X User Manual does indicate that the processor can support two 32-bit fixed point formats. However the terms used to explain these formats in the manual are rather terse. This indicates that the manual writers believe that fractional fixed point is something we should all already know about, rather than them needing to explain it in detail. Unless text-books are close-by, this blatant lack of understanding of the basic need to provide detailed examples of anything non-obvious to the average developer is best solved by a little further experimentation. In Figure 4, an attempt has been made to put fractional numbers into integer registers and perform basic addition and fractional multiplication operations. The results in the data

Page 8 of 15 Developed 2004/01/26 registers are interpreted using the signed-fractional format available in the VisualDSP development tool. Figure 4. Activating the signed fractional interpretation of the bit patterns stored in the integer data registers shows that signed signed fractional operations are internally consistent, even if they don t give the anticipated results. We can see that we are heading in the right direction. When compared to the initial signed fractional contents of the registers, the final contents of the registers correctly indicate that 0.4867 + 0.4890 = 0.9757 and 0.4867 * 0.4890 = 0.2380 The only problem is that we had actually been hoping to convince the processor to perform the fractional operations 0.2 + 0.3 = 0.5 0.2 * 0.3 = 0.06 Fractional Integers From Figure 4 it can be seen that setting the integer registers to fractional values (R0 = 0.2 and R1 = 0.3) does not lead to the corresponding signed fractional integer values. Strange that there is not a built-in assembler directive to generate the bit patterns for fractional integer values. Miscalculating the necessary hexadecimal bit patterns for fractional values has been a constant source of errors in my group. However a sensible relationship between fractional integer values and their hexadecimal representations has appeared. Registers R5 to R7 contain the results of applying a series of arithmetic shifts (ASHIFT) on the largest negative 32-bit integer value (0x80000000) placed into register R4. Using a negative value with the 21K ASHIFT operator produces an arithmetic right shift, which is equivalent to signed integer divisions by 2 32-bit HEX FRACTIONAL VALUE VALUE 0x80000000-1.0 0xC0000000-0.5 0xE0000000-0.25 0xF0000000-0.125

Page 9 of 15 Developed 2004/01/26 A similar pattern of bits will be familiar to developers who have spent time hooking up a 12-bit A/D to the data bus of a processor. maximum magnitude that can be represented in a signed, two s complement, number representation. Figure 5 shows that this is a selfconsistent interpretation with 12-bit HEX VALUE FRACTION OF FULL SCALE A/D VOLTAGE -0.5 * -0.5 (ssf) = 0.25, -0.5 * 0.5 (ssf) = -0.25 and -1.0 * 0.5 (ssf) = -0.5 0x800-1.0 0xC00-0.5 0xE00-0.25 0xF00-0.125 0x000 0.0 0x100 0.125 etc. This equivalence suggests that one way of looking at fractional-signed integers on the 21K is to interpret the bit pattern as representing a fraction of the However Figure 5 also hints that a deeper understanding of the fractional integer representation is needed for proper algorithm development since -1.0 * -1.0 (ssf) = -1.0 Fixing this problem in an algorithm by using the 80-bit accumulator associated with the SHARC integer multiplier is the subject of some future article. Figure 5. Most signed-signed fractional multiplication operations lead to the anticipated result. However the signed-signed fractional multiplication of 1 and 1 would lead to the invalid result of 1, a problem that can often be solved in an algorithm by using the SHARC 80-bit accumulator associated with the integer multiplier.

Page 10 of 15 Developed 2004/01/26 21K Floating Point Division There is an 11-cycle floatingpoint division instruction, FDIV, present on the AMD 29050 processor. However that instruction complicates 29K assembly coding as it is far slower than other 29K floating point operations, and much more difficult to pipeline efficiently. A design feature of the 21K processor is that the majority of its instructions complete in a single cycle, leaving no place for a slow floatingpoint division operation. The presence of the 21K reciprocal instruction RECIPS suggests a two-stage division operation. First the reciprocal of the denominator is calculated in one cycle, and then the numerator is multiplied by this reciprocal in a second cycle. However, as can be seen from Figure 6, this approach just does not seem to work right using the RECIPS instruction. The approach does however work if the reciprocal is directly evaluated by hand. This strange behaviour is a result of the fact that a ROM look-up table is needed for reciprocals to be calculated in a single cycle. High accuracy reciprocals would require an enormous amount of silicon to implement. Instead a limited accuracy approximation (seed) of the reciprocal is calculated for more information see the SHARC user manual, page B-39. Comparing the hexadecimal representations of the reciprocal seed (F2) and the true reciprocal (F4) reveals the limited accuracy of the result from the RECIPS operation. Figure 7 shows how a floatingpoint division can be obtained in 8 cycles using the super-scalar capability of the SHARC processor and an iterative convergence algorithm (See reference by Cavangh). A +-1 LSB accurate single precision quotient can be obtained after only 6 cycles. The strange choice of data registers in the algorithm is a direct consequence of the SHARC architecture, which only allows super-scalar operations between certain banks of data registers. Figure 6. The SHARC RECIPS instruction provides a limited accuracy reciprocal seed value in a single cycle from a ROM look-up table.

Page 11 of 15 Developed 2004/01/26 Figure 7. A convergence algorithm is used to calculate a floating-point division in 8 cycles using the super-scalar SHARC instructions. Custom Division Integer and Floating Point As discussed in the previous section a fast, accurate, division instruction would require considerable silicon. Since most algorithms involve only a few divisions a reasonable compromise is to have a not-so fast instruction (AMD 29K), or an iterative procedure available (ADSP 21K). One exception to this rule is that divisions by powers of 2, e.g. 4, 8, 16 etc., happen frequently. Such operations are needed to scale integer inputs to ensure that the algorithm does not overflow its number representation (see CCI-???, December 2000). Floating point scaling by powers of 2.0 is necessary for outputs from algorithms such as the inverse Fourier transform. On the integer side, single-cycle 21K arithmetic left and right shifts can handle scaling by powers of 2 as was demonstrated in Figure 4. // Fast integer division R0 = R1 / 16 R0 = ASHIFT R1 BY 4 However floating-point numbers are represented in a far more complicated manner using three different bit-fields within a 32-bit register. This means that the shift approach for integers must be changed to an equivalent, but very different, operation to achieve floatingpoint scaling. This operation requires detailed understanding of the IEEE floating point number representation.

Page 12 of 15 Developed 2004/01/26 Figure 8 shows the 32-bit representation of the three fields of a floating point (FP) number. s -- the sign field bexp -- the biased exponent field frac -- the fractional field There is a 33 rd normalization bit that is James Bonded hidden but not stored. Every valid FP number can be represented using this format (-1) s (bexp 127) x 1.frac x 2 Figure 9 illustrates the transformation of the decimal number 34.0 till its storage as a FP value in hexadecimal format. Table 1 shows the IEEE standard representation of pairs of floating point numbers that differ by a factor of 16.0. When broken into the three floating point fields, it is easy to see that pairs of floating point numbers that are scaled by a factor of 16.0 differ by a fixed value of 4 in their biased exponent. With this information a fast floating-point scaling operation can be handled through a single cycle, subtraction integer operation that directly adjusts the bexp bits of a floating-point number. // Setup of BEXP adjustment factor R0 = 4; R0 = ASHIFT R0 BY 23 F4 = 1023.4; // Integer operation to perform // a single cycle FP division by 16.0 R4 = R4 R0; Floating point scaling via integer operations can, in principle, be implemented on any processor, BUT does it really work? Figure 10 shows a series of floating point numbers divided by 16.0 in a single cycle rather than the standard eight cycles for a standard SHARC division. As can be seen in the figure the operation works perfectly well for scaling the numbers 4.0, -2.0 and 1.0 but suffers a significant problem when scaling 0.0. 31 30 23 22 0 s bexp biased exponent frac fractional field Figure 8. The representation of a IEEE standard floating point number takes 33 bits. One bit for the sign, 8-bits for the biased exponent and 23-bits for the fractional field. The 33 rd normalization bit is James Bonded -- hidden rather than stored.

Page 13 of 15 Developed 2004/01/26 Conversion to binary value 34.0 = %100010 Conversion to 1.frac binary format %100010 = %1.00010 * 2 5 Conversion to biased exponent format %1.00010 * 2 5 (132 127) = %1.00010 * 2 Identification of the 3 IEEE FP fields s = %0 = 0x0 bexp = 132 = 0x84 = %1000 0100 frac = %000 1000 0000 0000 0000 0000 = 0x100000 Representation of 34.0 in 32-bits = %0 1000 0100 000 1000 0000 0000 0000 0000 = %0100 0010 0000 1000 0000 0000 0000 0000 = 0x42080000 Figure 9. The decimal number 34.0 goes through a series of stages to identify the three floating point bit fields before being stored as the 32-bit value 0x42080000 Number Internal Hex FP fields FP Representation s bexp frac 1.0 0x3F80 0000 0 0x7F 0x00 00 00 16.0 0x4180 0000 0 0x83 0x00 00 00-1.0 0xBF80 0000 1 0x7F 0x00 00 00-16.0 0xCF80 0000 1 0x83 0x00 00 00 63.9625 0x427FD99A 0 0x84 0x7F D9 9A 1023.4 0x447FD99A 0 0x88 0x7F D9 9A Table 1. Floating point numbers that differ by a scaling factor of 16.0 have 32-bit representations with biased exponent fields, bexp, that differ by 4. All the other fields remain the same.

Page 14 of 15 Developed 2004/01/26 Figure 10. Fast floating point division by a factor of 2 can be implemented via a single cycle integer subtraction rather than an 8-cycle iterative division. However additional checks must be added for accuracy. Figure 11. Tests coded into the first scaling algorithm expose the SHARC s instruction pipeline and takes 6 cycles if the value is large and 8 cycles if the value is small. The second algorithm makes use the SHARC s conditional compute and superscalar statements to avoid pipeline stalls (3 cycles).

Page 15 of 15 Developed 2004/01/26 The problem with the number 0.0 is that its biased exponent is too small to allow a valid FP number to be generated after 4 is subtracted to perform the fast scaling operation. This problem did not occur with the integer scaling operation using arithmetic shifts. If the scaling factor was too large then all the significant bits were shifted out of the value to automatically leave 0. Something equivalent needs to happen for the fast floating point operations. The scaling approach works all the time if you can guarantee that every number you use has a magnitude greater than 2 (p 127) (where p is the power of 2 by which you are scaling). Figure 11 shows two versions of a modified scaling operation incorporating the tests. The first algorithm is slow (6 cycles) as the SHARC s instruction pipeline is exposed by the conditional jumps. The second algorithm makes use the SHARC s conditional compute and super-scalar statements to avoid pipeline stalls (3 cycles). We have now implemented an accurate floating-point scaling operation which, at 3 cycles, works faster than the standard 8-cycle SHARC division. Was it worth the effort? Nah! If you really want to scale floating point registers F0, F2, F4 and F6 by 16.0 then the simplest algorithm involving single cycle operations is // Determine the reciprocal of 16.0 F8 = 0.0625; F1 = F0 * F8; F3 = F2 * F8; F5 = F4 * F8; F7 = F6 * F8; But you ve got to admit it is a neat party trick to know how to change bexp! Conclusion In this article a number of the characteristics of the Analog Devices 21061 SHARC processor were discussed. These included internal representation of integers, an unexpected default setting for integer multiplication operations and a brief introduction to fractional integers. On the floating-point side, a technique was discussed for performing a floating-point division in the absence of a fast FDIV instruction. Details of a faster, custom, floating-point scaling operation was demonstrated using conditional compute and superscalar instructions. Notes. The James Bond pun should be read while thinking in an English accent! Quirks and Quarks is a long running CBC Science program, now on the Web. Acknowledgments Thanks go out to Con Korikis (Analog Devices University Support) and Tony Moosey (SHARC DSP Tools Support). References ADSP-2106x SHARC User s Manual, 2 nd Edition, Analog Devices, 1996. J. Cavanagh, Digital Computer Arithmetic, McGraw-Hill, page 284, 1984. About the Author Mike is a professor at the University of Calgary, Canada where he teaches, and does research in, introductory and advanced microprocessor topics. He can be reached at smithmr@ucalgary.ca.