Engineer To Engineer Note - PDF Free Download

Engineer To Engineer Note EE-186 Technicl Notes on using Anlog Devices' DSP components nd development tools Contct our technicl support by phone: (800) ANALOG-D or e-mil: dsp.support@nlog.com Or visit our on-line resources http://www.nlog.com/dsp nd http://www.nlog.com/dsp/ezanswers Extended-Precision Fixed-Point Arithmetic on the Blckfin Processor Pltform Contributed by DSP Apps My 13, 2003 Introduction The Blckfin Processor pltform ws designed to efficiently perform 16-bit fixed-point rithmetic opertions. There re times, however, when it my become necessry to increse ccurcy by extending precision up to 32 bits. The first prt of this document describes n extended-precision, fixed-point rithmetic technique tht cn be emulted on Blckfin Processors using the ntive 16-bit ALU instructions. The second prt illustrtes Blckfin Processor ssembly implementtions of the 31- nd 32-bit- ccurte FIR filters. An ccompnying source code pckge contins full FIR nd IIR ssembly progrms. Bckground Extended-precision rithmetic is nturl softwre extension for 16-bit fixed-point processors. In mchines with 16-bit register files, two registers cn be used to represent one 31-bit or 32-bit fixed-point number. Blckfin Processors re idelly suited for extendedprecision rithmetic, becuse the register file is bsed on 32-bit registers, which cn either be treted s 32-bit entities or two 16-bit hlves. Before getting into specific DSP lgorithms, it is importnt to see how bsic rithmetic opertions cn be implemented with extended precision. Addition The Blckfin Processor instruction set contins single-cycle 32-bit ddition of the form R0 = R1 + R2. Therefore, no emultion is necessry for dding two 32-bit numbers. Subtrction of 32- bit numbers is lso ntively supported in the sme form s ddition: R0 = R1 - R2. Note tht, in these ddition nd subtrction instructions, ny combintion of dt registers cn be used. More Multipliction detiled informtion on the ntive Blckfin Processor opertions cn be found in the Blckfin Processor Instruction Set Reference. In order to introduce the concept of extendedprecision multipliction, it is useful to review the lredy fmilir deciml multipliction. Two-Digit Deciml Multipliction Let s strt by reclling how ny deciml multipliction cn be performed by knowing how to multiply single-digit numbers. As n exmple, consider this two-digit by twodigit deciml multipliction: 23 x 98 = 2254 Figure 1 illustrtes how this prticulr opertion cn be broken down into smller opertions. This is bsiclly multipliction by hnd. Copyright 2003, Anlog Devices, Inc. All rights reserved. Anlog Devices ssumes no responsibility for customer product design or the use or ppliction of customers products or for ny infringements of ptents or rights of others which my result from Anlog Devices ssistnce. All trdemrks nd logos re property of their respective holders. Informtion furnished by Anlog Devices Applictions nd Development Tools Engineers is believed to be ccurte nd relible, however no responsibility is ssumed by Anlog Devices regrding technicl ccurcy nd topiclity of the content provided in Anlog Devices Engineer-to-Engineer Notes.

Figure 1 Deciml multipliction in detil 1000 s plce 100 s plce 10 s plce 1 s plce 2 3 x 9 8 ------------------------------------------ {} + 8 x 3 = 24 {b} + 8 x 2 = 16 x 10 1 {c} + 9 x 3 = 27 x 10 1 {d}+ 9 x 2 = 18 x 10 2 ------------------------------------------------------------------------------------ {e} 18 x 10 2 + 27 x 10 1 + 16 x 10 1 + 24 x 10 0 = 2254 To compute the finl result, the following opertions re necessry: Four single-digit multiplictions (lines {}, {b}, {c}, {d} in Figure 1) 8 x 3 = 24, 8 x 2 = 16, 9 x 3 = 27, 9 x 2 = 18 Three opertions to shift the sub-products into the correct digit-significnt slot (lines {b}, {c}, {d} in Figure 1) 18 x 10 2, 27 x 10 1, 16 x 10 1 Three dditions (line {e} in Figure 1) 18 x 10 2 + 27 x 10 1, 16 x 10 1 + 24, (18 x 10 2 + 27 x 10 1 ) + (16 x10 1 + 24) Two-Digit Hexdeciml Multipliction Hexdeciml multipliction is not much different from its deciml counterprt. Let s consider multipliction of two 32-bit frctionl numbers, where the opernds re stored in the 32-bit generl-purpose dt registers R0 nd R1. Blckfin Processors ctully hve built-in 32- bit multiply opertion of the form: R1 *= R0. It is multi-cycle instruction tht tkes 5 cycles to execute from L1 memory. It is possible to improve this performnce with the 16-bit multipliction technique tht follows. 32-Bit Accurcy with 16-Bit Multipliction Insted of relying on this instruction, one cn use elementry rithmetic to chieve 32-bit multipliction result with single-cycle 16-bit multiplictions. Ech of the two 32-bit opernds (R0 nd R1) cn be broken up into two 16-bit hlves (R0.H, R0.L, R1.H, nd R1.L), s shown in Figure 2. Figure 2 Hexdeciml multipliction in detil bits 63:48 47:32 31:16 15:0 R0.H xr1.h R0.L R1.L ------------------------------------------ {} >> 32 + R1.L x R0.L {b} >> 16 + R1.L x R0.H {c} >> 16 + R1.H x R0.L {d}+ {e} R1.H x R0.H ------------------------------------------------------------------------------------ (R1.H x R0.H) + (R1.L x R0.H) >> 16 + (R1.H x R0.L) >> 16 + (R1.L x R0.L) >> 32 = R1 x R0 From this figure, it is esy to see the opertions required to emulte the 32-bit multipliction R0 x R1 with combintion of instructions using 16- bit multipliers: Four 16-bit multiplictions to yield four 32- bit results (lines {}, {b}, {c}, {d} in Figure 2) R1.L x R0.L, R1.L x R0.H, R1.H x R0.L, R1.H x R0.H Three opertions to shift the sub-products into the correct digit-significnt slot (lines {}, {b}, {c} in Figure 2). Since we re performing frctionl rithmetic, the result is 1.63 (1.31 x 1.31 = 2.62 with redundnt sign bit). Most of the time, the result cn be truncted to 1.31 in order to fit in ntive 32-bit dt register. Therefore, the result of the multipliction should be in reference to the sign bit, or the most significnt bit. In this wy, the rightmost lest significnt bits cn be sfely discrded in trunction. Extended-Precision Fixed-Point Arithmetic on the Blckfin Processor Pltform (EE-186) Pge 2 of 5

(R1.L x R0.L) >> 32, (R1.L x R0.H) >> 16, (R1.H x R0.L) >> 16 Three opertions to preserve bit plce in the finl nswer (line {e} in Figure 2): (R1.L x R0.L) >> 32 + (R1.L x R0.H) >> 16, (R1.H x R0.L) >> 16 + R1.H x R0.H, ((R1.L x R0.L) >> 32 + (R1.L x R0.H) >> 16) + ((R1.H x R0.L) >> 16 + R1.H x R0.H) The finl expression for 32-bit multipliction is: 31-Bit Accurcy with 16-Bit Multipliction From Figure 2, it is esy to see tht the multipliction of the lest significnt hlf-word R1.L x R0.L does not contribute much to the finl result. In fct, if the finl result is ultimtely truncted to 1.31 nywy, then this multipliction cn only hve n effect on the lest significnt bit of the 1.31 result. For mny pplictions, the loss of ccurcy due losing to this bit is blnced by the performnce increse over the 32-bit multipliction. Three opertions (one 16-bit multipliction, one shift, nd one ddition) cn be eliminted if 31-bit ccurcy is cceptble in the finl design: The remining instructions necessry to get 31- bit-ccurte 1.31 nswer re three 16-bit multiplictions, two dditions, nd shift: R1 x R0 = ((R1.L x R0.H) >> 16) + ((R1.H x R0.L) >> 16 + R1.H x R0.H) Further rerrngement of terms yields the finl form of 31-bit-ccurte multipliction: R1 x R0 = ((R1.L x R0.H) + R1.H x R0.L) >> 16 + (R1.H x R0.H) Double-Precision FIR Filter Implementtion 32-Bit-Accurte FIR Filter If we consider R0 to be the dt vlue nd R1 to be coefficient vlue, then ech multipliction in the FIR will be of the form described erlier: The kernel for 32-bit-ccurte FIR implementtion is shown in Listing 1. The number of cycles needed to execute the full implementtion is 28 + N*(3*T+5) cycles, where N is the size of the input buffer nd T is the number of filter tps. Complete source code for 31- nd 32-bitccurte FIR nd IIR filters is contined in the compressed pckge ccompnying this document. Listing 1 Kernel of 32-bit-ccurte FIR // I0 = ddress of the dely line buffer // I1 = ddress of the input rry // I2 = ddress of the coefficient rry // I3 = ddress of the output rry // P0 = number of input smples // P2 = number of coefficients // The outer loop itertes over ll the dt smples LSETUP(FIR_START, FIR_END) LC0=P0; FIR_START: // The first section performs multiplyccumulte on the lest significnt hlves of the dt nd coefficients (R0.L*R1.L), nd implicitly shifts the result >> 32 by plcing it in ccumultor A1 LSETUP(M_ST, M_ST) LC1=P2; A0=R0.L*R1.L (FU) R0=[I1--] R1=[I2++]; M_ST: R3.L=(A0+=R0.L*R1.L) (FU) R0=[I1--] R1=[I2++]; A1=R3; // In this section, the product of the most significnt words (R0.H*R1.H) gets ccumulted to A1, nd the products R0.L*R1.H nd R1.L*R0.H get ccumulted into A0 onto the running sum from the first section. The bit plcement shift is explicit in the R3=R3>>>15 instruction A0=R0.H*R1.H, A1+=R0.H*R1.L (M) [I3++]=R2; LSETUP(MAC_ST,MAC_END) LC1=P2; MAC_ST: A1+=R1.H*R0.L (M) R0=[I1--] R1=[I2++]; Extended-Precision Fixed-Point Arithmetic on the Blckfin Processor Pltform (EE-186) Pge 3 of 5

MAC_END: R2=(A0+=R0.H*R1.H), A1+=R0.H*R1.L (M); R3=(A1+=R1.H*R0.L) (M) I4+=4 R0=[I0++]; R3=R3>>>15 [I1--]=R0 R1=[I2++]; // The finl sum gives the nswer FIR_END: R2=R2+R3 (S); 31-Bit-Accurte FIR Filter A 31-bit-ccurte FIR filter cn be useful for extended precision in udio lgorithms. The 31- bit-ccurte multipliction (illustrted bove) cn be used for the FIR kernel computtion: R1 x R0 = ((R1.L x R0.H) + R1.H x R0.L) >> 16 + (R1.H x R0.H) The Blckfin Processor source code for the 31- bit-ccurte FIR filter is shown in Listing 2. The number of cycles needed to execute the full implementtion is 23 + N*(2*T+4) cycles, where N is the size of the input buffer nd T is the number of filter tps. Listing 2 Kernel of 31-bit-ccurte FIR // I0 = ddress of the dely line buffer // I1 = ddress of the input rry // I2 = ddress of the coefficient rry // I3 = ddress of the output rry // P0 = number of input smples // P2 = number of coefficients // M0 = 8 // The outer loop itertes over ll the dt smples A1=A0=0 R0=[I1--] R1=[I2++]; LSETUP(FIR_START, FIR_END) LC0=P0; FIR_START: // Compred to the first section in the 32-bit-ccurte FIR, this implementtion omits the lest significnt hlves (R0.L nd R1.L) of the opernds. The product of the most significnt words (R0.H*R1.H) gets ccumulted to A0, nd the products R0.L*R1.H nd R1.L*R0.H get ccumulted into A1. The bit plcement shift is explicit in the R3=R3>>>15 instruction LSETUP(MAC_ST,MAC_END) LC1=P2; MAC_ST: R2=(A0+=R0.H*R1.H), A1+=R0.H*R1.L (M); MAC_END: R3=(A1+=R1.H*R0.L) (M) R0=[I1--] R1=[I2++]; R3=R3>>>15 I1+=M0 R0=[I0++]; // R3 holds the finl nswer R3=R2+R3 (S) [I1--]=R0; FIR_END: A1=A0=0 [I3++]=R3; Summry This ppliction note described n effective method for implementing extended-precision rithmetic on Blckfin Processors. The discussion bout the trdeoffs between 31-bit ccurcy nd 32-bit ccurcy ws supported by code segments for n FIR filter. Tble 1 summrizes the performnce of the FIR nd IIR filters found in the compressed pckge supplied with this document. Tble 1 Computtion time for 31-bit nd 32-bit filter implementtions on Blckfin Processor FIR 32-bit ccurcy 28+N*(3*T+5) cycles 31-bit FIR ccurcy 23+N*(2*T+4) cycles IIR 23+18*N cycles 23+12*N cycles Extended-Precision Fixed-Point Arithmetic on the Blckfin Processor Pltform (EE-186) Pge 4 of 5

Document History Version My 13, 2003 by T. Luksik. April 1, 2003 by T. Luksik. Februry 26, 2003 by T. Luksik. Description Updted ccording to new nming conventions Revision to source code snippets nd ccompnying source code Initil relese Extended-Precision Fixed-Point Arithmetic on the Blckfin Processor Pltform (EE-186) Pge 5 of 5