Implementing the Fast Fourier Transform for the Xtensa Processor

Size: px

Start display at page:

Download "Implementing the Fast Fourier Transform for the Xtensa Processor"

Kelly Phillips
6 years ago
Views:

1 Implementing the Fast Fourier Transform for the Xtensa Processor Application Note Tensilica, Inc Scott Blvd. Santa Clara, CA (408) Fax (408) November 2005 Doc Number: AN

2 2005 Tensilica, Inc. Printed in the United States of America All Rights Reserved This publication is provided AS IS. Tensilica, Inc. (hereafter Tensilica ) does not make any warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Information in this document is provided solely to enable system and software developers to use Tensilica processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual property rights or licenses granted hereunder to design or fabricate Tensilica integrated circuits or integrated circuits based on the information in this document. Tensilica does not warrant that the contents of this publication, whether individually or as one or more groups, meets your requirements or that the publication is error-free. This publication could include technical inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated in new editions of this publication. Tensilica is a registered trademark of Tensilica, Inc. The following terms are trademarks of Tensilica, Inc.: OSKit, Vectra, and Xtensa. All other trademarks and registered trademarks are the property of their respective companies. NO LIABILITY FOR CONSEQUENTIAL DAMAGES.TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL Tensilica BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING WITHOUT LIMITATION, SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR INDIRECT DAMAGES FOR PERSONAL INJURY, LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, OR ANY OTHER PECUNIARY LOSS) ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PRODUCT CODE, EVEN IF MANUFACTURER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Document Change History: First Published, February 2001, revised 3/2005 Updated for LX/Xtensa 6, 11/2005 ii

3 Contents 1 Introduction Basic C Implementation Development of TIE Language Instructions from the C Algorithm...5 General Cases for Loop Pipelining...5 Case Case TIE Development Steps Complete C Implementation with TIE Performance Conclusion Glossary of Terms...19 Appendix A Complete Application Code...20 radix_2.c...21 radix_2_hand_code.s...27 bfly.tie...30 util.h...37 util.c...38 Appendix B Application Output...40 Appendix C Import and use of workspace...41 iii

4 Figures Figure 1. Definition of the Discrete Fourier Transform... 1 Figure 2. Decimation-in-Frequency DFT in Terms of Even- and Odd-numbered Frequency Samples 2 Figure 3. FFT Decimation-in-Frequency Algorithm... 3 Figure 4. Basic Implementation of Radix-2 Decimation-in-Frequency FFT Algorithm... 4 Figure 5. Complex Butterfly Computation... 5 Figure 6. First Step in Converting FFT Algorithm into TIE Implementation... 9 Figure 7. Showing Direct Correspondence of C Code to TIE Language Instructions Figure 8. TIE Language Instructions for the FFT Algorithm Tables Table 1. Performance Results of Xtensa FFT Implementation... 1 Table 2. Code Size and Performance Results iv

5 Abstract The goal of this application note is to show the results and design methodology for a highperformance DSP sample application on the Xtensa microprocessor using a widely known example, the Fast Fourier Transform (FFT). This note first explains the basic algorithm and how several TIE language instructions were created to implement the FFT algorithm. Performance results follow, with a comparison of implementations of the radix-2 decimation-in-frequency FFT with and without additional TIE language extensions. This application note assumes the reader has a basic understanding of the ASIC design methodology used with Xtensa processors and some familiarity with digital signal processing. Preface The Implementing the Fast Fourier Transform for the Xtensa Processor application note was LX and Xtensa 6 release of the Xtensa processor. Location Old: New: Appendix A radix_2.c /* Add this "# include" statement below, to include * C stub simulation of TIE instructions. * Used when when * Use this when confirming TIE instruction functionality * in native environment. * * #include "tdk/cstub-bfly.c" */ #include "tdk/cstub-bfly.c" Location Appendix A util.c, end of source Old: return 0; New: return 0; Additional Notes: Section 4 Complete C Implementation with TIE (Addendum) The testbench requires an input file that describes the time-domain complex samples for the FFT algorithm to process. The input file is an ASCII file. The first line of this file states the number of floating point data on the following lines of the file. Proceeding lines will contain the floating point data samples. The testbench will automatically convert the floating point sample to 16bit fixed point format. For example, a 256-point complex data set will have 256 real and 256 imaginary samples. Therefore, 512 data points are listed. The samples are given as floating point numbers from 1 to Complex pairs are listed as real sample first and imaginary sample second. Each sample is listed on a new line. v

6 An example of the ASCII input file for a square wave is shown below: (pattern repeats 63 more times) Build and exercise the FFT testbench for the Xtensa LX/6 processor. Import the workspace into Xplorer and select the Run or Profile button. Alternately, enter the following commands after the configuration is installed and the environment is properly set up: >tc d tdk bfly.tie >set XTENSA_PARAMS=.\tdk (or setenv XTENSA_PARAMS./tdk, for Unix/Linux) >xt-gcc O3 radix_2.c lm util.c rasix_2_hand_code.s o fft >xt-run --pipe fft <input_file> vi

7 1 Introduction One of the most widely used algorithms in digital signal processing applications is the Fast Fourier Transform (FFT), which covers a family of techniques for computing the Discrete Fourier Transform (DFT), as defined in the equation in Figure 1. The FFT is a large class of algorithms applicable to a variety of signal processing applications such as audio, video, and speech. For the purpose of this application note, one of these algorithms is highlighted for its simplicity, and is used to show how to achieve performance improvements on Xtensa processors. X[ k] = 1 N N 1 n= 0 x[ n] e 2π j( ) nk N k = 0,1, K, N 1 FIGURE 1. DEFINITION OF THE DISCRETE FOURIER TRANSFORM. In Figure 1, X [k] are the N frequency samples being computed, x[n] are the time-domain input samples, n is the index of time-domain samples, and k is the index of the frequencydomain samples. Also, j is defined such that j = 1 following a common convention. Note 2 that for convenience, the 1 N scaling factor is omitted in the Xtensa implementation described in this document. The purpose of this application note is to explain an optimized implementation of the FFT for the Xtensa processor using Tensilica Instruction Extension (TIE) language constructs. The TIE constructs are optimized instructions that are integrated into the Xtensa processor pipeline. Using the TIE language involves considering what operations can benefit most from hardware optimization. This application note will show that optimizing the FFT butterfly operations with TIE language hardware extensions attains results competitive with high-performance DSP implementations of FFT. The following table summarizes the performance obtained by using the C code and handoptimized assembly code, both using TIE language instructions. This application note focuses on TIE language development for the case of the FFT algorithm implemented in C. The development of the assembly implementation is not discussed in detail in this application note, as it incorporates the same TIE instructions as the C implementation. The performance improvement of the assembly code is attributable to its special structure, which differs from that of the C implementation. In addition to the TIE optimization we applied IPA (Interprocedural Analysis) and Profile Directed Compilation. Both of these optimizations are Xtensaspecific optimizations leveraging from the Xtensa tool chain. TABLE 1. PERFORMANCE RESULTS OF XTENSA FFT IMPLEMENTATION With TIE C Assembly Code Size (Bytes) Performance 128-point (cycles) 256-point point point For a complete background and detailed derivations of the various FFT algorithms, refer to one of the many textbooks on the subject (for example, Discrete-Time Signal Processing by Oppenheim, et al., Prentice-Hall, 1999). Also, please note that Tensilica offers a set of FFT software libraries for the Vectra TM DSP Engine, available as a coprocessor option to the Xtensa processor. The Vectra engine is optimized for a broad array of fixed-point arithmetic functions. The techniques described in this 1

8 application note offer a dedicated solution for acceleration of the FFT algorithm with TIE instructions. To understand which approach is most appropriate for your application, please refer to the Vectra DSP Engine User s Guide, or consult an Applications Engineer of Tensilica, Inc. 2 Basic C Implementation When N is divisible by 2, the N -point DFT can be recursively decomposed into two N 2 -point DFTs. This decomposition is the basis for what is commonly described as a radix-2 FFT algorithm. Additionally, there are two canonical ways to perform a decomposition into two DFTs. One way, known as decimation-in-time, uses the even and odd time-domain samples (decimated sequences) as input to the two DFTs. The second approach is to decompose into two DFTs such that their results are the even and odd frequency domain sample sequences, or decimated output sequences this is known as decimation-in-frequency. The decimation-in-frequency decomposition of the DFT simplifies the DFT summation, defined previously in Figure 1, by decimating the frequency domain samples by 2. The summation is split into two summations across the even-numbered frequency samples X[ 2k] for k = 0,1, K, N 2 1, as well as the odd-numbered samples X [ 2k + 1], for k = 0,1, K, N 2 1. The decomposition is simplified into two equations that represent the DFT results in the even and odd cases; the two equations are given in Figure 2: ( N 2) X[2r] = ( N 2) X[2r + 1] = 1 n= 0 1 n= 0 ( x[ n] + x[ n + ( N 2) ] ) ( x[ n] x[ n + ( N 2) ]) e e 2π j( ) n N 2π j( ) nr N 2 e 2π j( ) nr N 2 ( 2) 1 r = 0,1, K, N ( 2) 1 r = 0,1, K, N FIGURE 2. DECIMATION-IN-FREQUENCY DFT IN TERMS OF EVEN- AND ODD-NUMBERED FREQUENCY SAMPLES 2 j( π ) n n N It is convenient to define a set of factors W N e n 0,1, K, N 2 for the computation of the odd-numbered samples in the DFT decomposition. As shown in Figure 2, performing the decomposition involves additions, subtractions, and multiplications by the = over the set { factors W. Additional decomposition stages require the factorsw 2, When n N n WN 4, n WN 8, and so on. N is a power of 2, then the DFT can be implemented by log ( N ) such recursive 2 j( ) n = π n decompositions. The quantities W e m m m = N, N 2, K,2; n = 0,1, K, m 2 required in the FFT stages are referred to as twiddle factors. Any stage can be implemented by the original definition (in Figure 1) or by a decomposition. In the case that all log 2 ( N ) stages are implemented by the decimation-in-frequency decomposition, then each stage requires only adds, subtracts, and multiplies by twiddle factors. Figure 3 depicts the radix-2 decimation-in-frequency FFT. This paper will focus on implementing an N -point decimation-in-frequency FFT computation for complex input values. 2 2

9 N/2 point DFT W N 1 W N 2 W N W 3 N N/2 point DFT W N N/2-2 W N N/2-1 FIGURE 3. FFT DECIMATION-IN-FREQUENCY ALGORITHM Our FFT algorithm implementations use the following conventions: The complex N -point input to the FFT is organized as an array of 2N components, with elements alternating in correspondence to real and imaginary components of the input N -point sequence. The twiddle factors are stored as a set of arrays used as input to the FFT algorithm: 0 N N N 4 1 { W N L WN ; WN 2 LWN 2 ; WN 4 LWN 4 ; K The real and imaginary components of the twiddle factors alternate in the sequence. Each successive sequence contains half as many twiddle factors as the previous one. The computation assumes a 24-bit fixed-point representation with 16 places to the right of the binary point. The basic algorithm implemented in C is shown in Figure 4. 3

10 #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << (15)) void r2_fft_basic(int *data, int n, int *twiddles) { register int i, d0r, d0i, d1r, d1i, r0r, r0i, r1r, r1i; register int *p, *q, wr, wi; register long long tr, ti; register int m = n; register int *t = twiddles; while (m > 1) { /* Outer loop (Decomposition into stages)*/ for (i = 0; i < m; i += 2) { /* Middle loop */ p = data + i; q = p; wr = *(t++); wi = *(t++); while (q < &data[n << 1]) { /* Inner loop */ d0r = *(p + 0); d0i = *(p + 1); p += m; d1r = *(p + 0); d1i = *(p + 1); p += m; r0r = d0r + d1r; r0i = d0i + d1i; tr = d0r - d1r; ti = d0i - d1i; r1r = DESCALE(tr * wr, round) - DESCALE(ti * wi, round); r1i = DESCALE(tr * wi, round) + DESCALE(ti * wr, round); *(q + 0) = r0r; *(q + 1) = r0i; q += m; *(q + 0) = r1r; *(q + 1) = r1i; q += m; m >>= 1; FIGURE 4. BASIC IMPLEMENTATION OF RADIX-2 DECIMATION-IN-FREQUENCY FFT ALGORITHM Here, the focus is on the series of radix-2 decompositions of an N -point DFT in which N is a power of 2; there are log 2 ( N) stages of butterfly operations being performed on the input. At the top of the algorithm, the DESCALE macro is defined both to keep 24-bits of accuracy in the result of a multiply operation and also to maintain consistency in precision between the C implementation and the Xtensa implementation that uses hardware defined in the TIE language, to give equally accurate results. By adding the round value to a 64-bit integer, then scaling the sum by 2-16, effectively the 64-bit integer is converted into a fixed-point value, rounded to the least-significant bit. log 2 successive stages of DFT computations. The for() called the middle loop initializes both p to point to the first data element of the current stage and also q to point to the same data element which will be overwritten with the output data. The outer while() loop counts the ( N ) 4

11 The innermost while() loop, called the inner loop, traverses the input elements for the current butterfly stage. For each stage h, where { 1, 2,, log 2 ( N ) 1 h K, there are 2 h 1 subproblems, that is, DFTs of length N 2 each. Each subproblem has a set of N 2 h+ butterflies that use two values each as input and compute two output values. The real and imaginary parts of the butterfly computation are shown in Figure 5. h ar sr = ar+br ai si = ai+bi br -1 cr sr = cr(ar-br) bi -1 cr si = ci(ai-bi) a FIGURE 5. COMPLEX BUTTERFLY COMPUTATION In each stage, however, each pass through the inner loop computes one set of butterfly results per subproblem. This order of computation allows for reuse of the same twiddle factor for all subproblems in the same FFT stage. Once the final group of butterflies are computed for all 2h subproblems, the next stage is h + 1, as m is further divided by 2 to prepare for another trip through the outer while() loop. 3 Development of TIE Language Instructions from the C Algorithm This section explains the function of the TIE language implementation. TIE instructions encapsulate functionality of the algorithm in hardware. For more detail on the TIE language basics, refer to the Tensilica Instruction Extension (TIE) Language User s Guide. The goal of this document is to show the functionality of the TIE implementation of the FFT algorithm by converting the simple C implementation, step-by-step, into an efficient C routine that uses TIE instructions. The goal for the TIE implementation of the radix-2 FFT is to achieve maximal performance with a reasonable hardware cost. TIE allows for performing an arbitrary number of parallel operations in a single instruction. However, it is not possible to perform more than one load or store per instruction or per cycle. Therefore, one general TIE programming technique is to fold computation operation into TIE load or store instructions. If every computational instruction is combined with a load or store instruction, a load or store instruction will be issued on each cycle. Without algorithmic changes, such an approach will lead to an optimal number of issued instructions. For many algorithms including the radix-2 FFT, it is possible to fold all computational instructions into load or store instructions with reasonable hardware costs. Even with minimizing the number of instructions, folded instructions might have large latencies that are either infeasible to build or lead to stalls in the pipeline. Adding hardware resources can eliminate this problem. New data can be loaded in parallel with computations done on old data, and even older results can be stored in parallel with the current computation. As long as different iterations of a loop do not depend on each other, this technique comes at the cost of additional hardware to buffer the results. This technique is popular in software and, in that context, is called software pipelining; here the pipeline is being managed in hardware. General Cases for Loop Pipelining To begin optimizing the FFT inner loop with TIE instructions, it makes sense to look at ways to optimize load-operation-store loops in general. This loop structure is applicable because the 5

12 inner loop of the general FFT algorithm in Figure 4 consists of loads, then computation, then stores. The loads bring in the new ar, ai, br, and bi, four input values. The computation creates four output values, rr, r1, sr, and si, and the outputs are stored at the end of the loop body. While the Xtensa architecture issues at most one load or store per processor cycle, the TIE instructions can be written with computations in parallel, limited by the amount of compute hardware available. Unrolling the load-operation-store loop offers the advantage of storing a previous result and computing the next result in parallel. The following two cases find the maximum latency allowed for computation. Case 1 Case 1 presents a load-operation-store pipeline loop with two inputs and one output and a computation latency of n cycles load 2 A load 2 B comp 1 C store 1 C compute 2 C store 2 C load 3 A load 3 B compute 3 C store 3 C load 4 A load 4 B 1 cycle 2 cycles 3 cycles comp C compute C compute C In this diagram, the loop is software pipelined and the loop iteration is indicated by subscripts. The first two loads of values A and B are not shown, but the second two loads (load2) are followed by a store instruction for the previous result (store1). Two more loads follow, which gives up to three cycles during which the latency of a compute operation can be hidden. Hence, for n=1, n=2, or n=3 cycles, there is no cycle-count penalty for performing a computation in parallel with the loads and stores. Thus, TIE instructions with appropriate loop pipelining can be used to perform computations in parallel with the loads and stores. Case 2 Case 2 presents a load-operation-store pipeline loop with two inputs, two outputs, and a computation latency of n cycles. comp 1 C,D store 1 C store 1 D load 2 A load 2 B compute 2 C,D store 2 C store 2 D load 3 A load 3 B compute 3 C,D load 4 A load 4 B load 5 A load 5 B store 3 C store 3 D compute 4 C,D compute 2 C compute 2 D 2 cycles 2 cycles 6 In this case the goal is to compute two results, which requires two store instructions. Software pipelining this loop requires two cycles for the loads and two cycles for the stores; hence, it is possible to perform in parallel two computations whose latency is two cycles each. If the computation is on more than one set of input data, efficient use of the hardware can be made while hiding latency. This case is similar to the FFT implementation of this application note

13 which uses a 2-cycle multiplier in each cycle on two sets of real and imaginary input (for 4 results) in the 4 slots available. The loop optimization described sets the tone for the way chosen to convert the general C algorithm into an efficient FFT implementation for an Xtensa processor with TIE. TIE Development Steps There are other optimization techniques applied to the FFT implementation. A technique called SIMD (Single Instruction Multiple Data) can accommodate simultaneous pipelining of independent computations. As Xtensa provides for 128-bit memory accesses, wide loads of four integer values at a time can address the need for two sets of two inputs for each butterfly. Two 128-bit loads best utilizes the available hardware to perform two butterfly computations. The structure of loads and stores is SIMD in that the real and imaginary components of two consecutive inputs can be loaded in one cycle. Similarly, the components of two complex outputs can be stored in one cycle, allowing the latency of two butterfly computations to be hidden. Figure 6 shows the result of several changes made to the basic C algorithm to create an optimized algorithm using TIE. First, the middle loop is unrolled and the bodies of the two resulting inner loop copies are combined. Duplicated variables perform two complex butterfly computations in the inner loop body. The following list summarizes the copies of variables: Variable c0r, coi, c1r, c1i a0r, a0i, a1r, a1i b0r, b0i, b1r, b1i r0r, r0i, r1r, r1i t0r, t0i, t1r, t1i s0r, s0i, s1r, s1i Meaning Twiddle factors (two sets of real and imaginary twiddle factors from adjacent elements of the twiddle factor array twiddles) Input data Input data at the alternate butterfly wing. b0 and b1 are each m samples away from the corresponding input data a0 and a1 Results of butterfly computations (upper wing which gives the sum of the butterfly inputs) Intermediate values used to find outputs to current butterfly stage (lower wing) Results of butterfly computations (lower wing which gives product of twiddle factor and difference of inputs) It is necessary to peel off the last trip through the outer loop because when N = 2 (the last stage of radix-2 decompositions), the butterfly computations are performed on consecutive input data. In prior butterfly computations, it is sufficient to use one address to load a0 r + j a0i and a1 r + j a1i, whose real and imaginary components were all consecutive in memory, and another address to load the values corresponding to b0 r + j b0i and b1 r + j b1i. But in the last trip, looking at four consecutive complex values gives a0 r + j a0i and b0 r + j b0i since the two inputs to each butterfly correspond to consecutive output values from the previous stage. This special characteristic of the last trip explains why the values of a0r, a0i, b0r, and b0i are loaded in this different order in the last trip. 7

14 #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << 15); void r2_fft_step1(int *data, int n, int *twiddles) { register int i, c0r, c0i, c1r, c1i, a0r, a0i, a1r, a1i; register int b0r, b0i, b1r, b1i, r0r, r0i, r1r, r1i, s0r, s0i, s1r, s1i; register long long t0r, t0i, t1r, t1i; register int m = n; register int *t = twiddles; register int *p, *q; while (m > 2) { /* Outer loop (Decomposition into stages)*/ for (i = 0; i < m; i += 4) { /* Middle loop */ p = data + i; q = p; c0r = *(t++); /* SIMD: c[0:3] = t[0:3] */ c0i = *(t++); c1r = *(t++); c1i = *(t++); while (q < &data[n << 1]) { /* Inner loop */ a0r = *(p + 0); /* SIMD: a[0:3] = p[0:3] */ a0i = *(p + 1); a1r = *(p + 2); a1i = *(p + 3); p += m; b0r = *(p + 0); /* SIMD: b[0:3] = p[0:3] */ b0i = *(p + 1); b1r = *(p + 2); b1i = *(p + 3); p += m; r0r = a0r + b0r; r0i = a0i + b0i; t0r = a0r - b0r; t0i = a0i - b0i; s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); r1r = a1r + b1r; r1i = a1i + b1i; t1r = a1r - b1r; t1i = a1i - b1i; s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); *(q + 0) = r0r; /* SIMD: q[0:3] = r[0:3] */ *(q + 1) = r0i; *(q + 2) = r1r; *(q + 3) = r1i; q += m; *(q + 0) = s0r; /* SIMD: q[0:3] = s[0:3] */ *(q + 1) = s0i; *(q + 2) = s1r; 8

15 *(q + 3) = s1i; q += m; m >>= 1; /* Final trip through Middle loop (case m = 2) */ p = data; q = p; c0r = *(t++); c0i = *(t++); c1r = *(t++); c1i = *(t++); while (q < &data[n << 1]) { /* Inner loop */ a0r = *(p + 0); a0i = *(p + 1); b0r = *(p + 2); /* b0r,b0i swapped with a1r,a1i */ b0i = *(p + 3); p += 4; a1r = *(p + 0); a1i = *(p + 1); b1r = *(p + 2); b1i = *(p + 3); p += 4; r0r = a0r + b0r; r0i = a0i + b0i; t0r = a0r - b0r; t0i = a0i - b0i; s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); r1r = a1r + b1r; r1i = a1i + b1i; t1r = a1r - b1r; t1i = a1i - b1i; s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); *(q + 0) = r0r; *(q + 1) = r0i; *(q + 2) = s0r; *(q + 3) = s0i; q += 4; *(q + 0) = r1r; *(q + 1) = r1i; *(q + 2) = s1r; *(q + 3) = s1i; q += 4; FIGURE 6. FIRST STEP IN CONVERTING FFT ALGORITHM INTO TIE IMPLEMENTATION With two sets of real and imaginary results being computed in each trip through the inner loop, the computation can now be pipelined. The goal is an implementation with loads or stores in every cycle and compute operations in parallel: 9

16 store store compute compute compute compute store store load load compute compute compute compute store store load load compute compute compute compute load load When pipelining computations with load/store operations, TIE state registers are required to buffer data retrieved from memory and store to memory. Meaningful computations can be performed on state registers only after values have been loaded into them; hence, new data is loaded into TIE state, and computations on older data can be performed on data previously latched from those state (input buffers) into different state registers. With these TIE considerations in mind, variables are needed that will act as load buffers into which the load first brings input data. Additionally, variables are created to act as store buffers that receive the results of the compute operations before they are stored to memory. Figure 7 shows an implementation in which additional buffer variables have been added. NOTE: These variables will directly correspond to TIE state registers; loads will move data directly from memory into TIE state, compute operations acting on TIE state can act independently of loads and stores, and stores move values from TIE state back into memory. In each iteration of the inner loop, the load buffers can hold the next four input values, or in other words, the real and imaginary components of the FFT input values a0 r + j a0i and a1 r + j a1i. This implementation will use 24-bit TIE state registers, and with a processor interface (PIF) that is 128 bits wide, loads are issued in two consecutive cycles to retrieve the values corresponding to two butterfly inputs. In our C implementation the variables a0r_load, a0i_load, a1r_load, a1i_load and also b0r_load, b0i_load, b1r_load, and b1i_load are used to represent the load buffers. In another two processor cycles, the already-computed FFT butterfly-stage outputs r0r, r0i, s0r, s0i, r1r, r1i, s1r, and s1i can be stored. There are also assignment instructions (after the outputs are computed) that latch the load buffer registers into the variables required for the computation. In the load-compute-store pipeline being implemented, the latching prepares for the next computation cycle. The direct correspondence between the C code given and the TIE implementation is shown as comments in Figure 7. The four basic instructions, BFLY0.LU, BFLY1.LU, BFLY2.SU, and BFLY3.SU have been created with variants used to perform the first two trips through the inner loop to prime the pipeline. In Figure 7 the inner loop is peeled to prime the pipeline. Also, the last trip through the middle loop for the case m=2 is peeled because the inputs to the butterfly computations on this pass are consecutive in memory, as discussed above. Before explaining the details of each TIE instruction, this paper examines how the C code for the FFT is written so that it is ready to have a direct correspondence with TIE; this is the essence of TIE development. It is useful to understand the TIE definition and processor restrictions that characterize how an instruction extension can be defined for the Xtensa processor. The Xtensa processor restricts the number of memory operations per cycle to one, but the width of the Processor Interface (PIF) determines the throughput. With a PIF width limit of 128 bits, a throughput of two sets of complex butterfly inputs in two cycles is achievable; the throughput of stores is the same. In turn, the data throughput determines how much computation must be accomplished, and moving the data into load buffers defined as TIE state registers allows as much computation as is necessary. 10

17 void The limitation of one load or store per cycle determines the number of cycles required to do one iteration of the computation. In turn, the number of cycles affects the amount of other hardware resources required. For example, in the FFT implementation each SIMD iteration (two original iterations) requires 8 multiplies and will execute in 4 cycles. Hence two multipliers (one SIMD multiplier) are required. In turn, the latency of the operations determining how many loop iterations ahead loads should happen ahead of the computation. In this FFT implementation, the two loads are placed one iteration ahead of the computation, and the two stores happen one iteration later than the computation. r2_fft_step3(int *data, int n, int *twiddles) { register int i, c0r, c0i, c1r, c1i, a0r, a0i, a1r, a1i; register int b0r, b0i, b1r, b1i, r0r, r0i, r1r, r1i; register long long t0r, t0i, t1r, t1i; register int s0r, s0i, s1r, s1i; register int a0r_load, a0i_load, a1r_load, a1i_load; register int b0r_load, b0i_load, b1r_load, b1i_load; register int r0r_store, r0i_store, r1r_store, r1i_store; register int s0r_store, s0i_store, s1r_store, s1i_store; register int m = n; register int *t = twiddles; register int *p, *q; while (m > 2) { /* Outer loop (Decomposition into stages) */ for (i = 0; i < m; i += 4) { /* Middle loop */ p = data + i; q = p; c0r = *(t++); /* LDC_LU */ c0i = *(t++); /* LDC_LU */ c1r = *(t++); /* LDC_LU */ c1i = *(t++); /* LDC_LU */ /* While loop below has been peeled so that the first 2 trips through * the loop are simplified. */ /* The first peeled trip through the loop is simplified so that * it doesn't perform arithmetic or store results */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; /* BFLY0_LU */ b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; /* BFLY1_LU */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* OPLCH */ 11

18 b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* OPLCH */ /* The second peeled trip through the loop is simplified so that * it doesn't store results */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; r0r = a0r + b0r; /* BFLY0 */ r0i = a0i + b0i; /* BFLY1 */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0, BFLY1 */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0 */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1 */ r1r = a1r + b1r; /* BFLY2 */ r1i = a1i + b1i; /* BFLY3 */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2, BFLY3 */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2 */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3 */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3 */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3 */ while (q < &data[n << 1]) { /* Inner loop */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; /* BFLY0_LU */ b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; /* BFLY1_LU */ r0r_store = r0r; s0r_store = s0r; r1r_store = r1r; s1r_store = s1r; /* BFLY0_LU */ r0i_store = r0i; s0i_store = s0i; /* BFLY1_LU */ r1i_store = r1i; s1i_store = s1i; /* BFLY3_SU */ r0r = a0r + b0r; /* BFLY0_LU */ r0i = a0i + b0i; /* BFLY1_LU */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU, BFLY1_LU */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU */ 12

19 r1r = a1r + b1r; /* BFLY2_SU */ r1i = a1i + b1i; /* BFLY3_SU */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_SU, BFLY3_SU */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_SU */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_SU */ a0r =a0r_load; a0i =a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3_SU */ b0r =b0r_load; b0i =b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_SU */ *(q + 0) = r0r_store; /* BFLY2_SU */ *(q + 1) = r0i_store; /* BFLY2_SU */ *(q + 2) = r1r_store; /* BFLY2_SU */ *(q + 3) = r1i_store; /* BFLY2_SU */ q += m; /* BFLY2_SU */ *(q + 0) = s0r_store; /* BFLY3_SU */ *(q + 1) = s0i_store; /* BFLY3_SU */ *(q + 2) = s1r_store; /* BFLY3_SU */ *(q + 3) = s1i_store; /* BFLY3_SU */ q += m; /* BFLY3_SU */ m >>= 1; /* Final trip through Middle Loop (m=2) */ p = data; q = p; c0r = *(t++); /* LDC_LU */ c0i = *(t++); /* LDC_LU */ c1r = *(t++); /* LDC_LU */ c1i = *(t++); /* LDC_LU */ /* While loop below has been peeled so that the first 2 trips through * the loop are simplified. */ /* The first peeled trip through the loop is simplified so that * it doesn't perform arithmetic or store results */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* OPLCH */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* OPLCH */ 13

20 /* The second peeled trip through the loop is simplified so that * it doesn't store results */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ r0r = a0r + b0r; /* BFLY0_LU_SWAP */ r0i = a0i + b0i; /* BFLY1_LU_SWAP */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU_SWAP, BFLY1_LU_SWAP */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU_SWAP */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU_SWAP */ r1r = a1r + b1r; /* BFLY2_LU_SWAP */ r1i = a1i + b1i; /* BFLY3_LU_SWAP */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_LU_SWAP, BFLY3_LU_SWAP */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_LU_SWAP */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_LU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3_LU_SWAP */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_LU_SWAP */ while (q < &data[n << 1]) { /* Inner loop */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ r0r_store = r0r; s0r_store = s0r; r1r_store = r1r; s1r_store = s1r; /* BFLY0_LU_SWAP */ r0i_store = r0i; s0i_store = s0i; /* BFLY1_LU_SWAP */ r1i_store = r1i; s1i_store = s1i; /* BFLY3_SU_SWAP */ r0r = a0r + b0r; /* BFLY0_LU_SWAP */ r0i = a0i + b0i; /* BFLY1_LU_SWAP */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU_SWAP, BFLY1_LU_SWAP */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU_SWAP */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU_SWAP */ r1r = a1r + b1r; /* BFLY2_SU_SWAP */ r1i = a1i + b1i; /* BFLY3_SU_SWAP */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_SU_SWAP, BFLY3_SU_SWAP */ 14

21 s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_SU_SWAP */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_SU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i =a1i_load; /* BFLY3_SU_SWAP */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_SU_SWAP*/ *(q + 0) = r0r_store; /* BFLY2_SU_SWAP */ *(q + 1) = r0i_store; /* BFLY2_SU_SWAP */ *(q + 2) = s0r_store; /* BFLY2_SU_SWAP */ *(q + 3) = s0i_store; /* BFLY2_SU_SWAP */ q += 4; /* BFLY2_SU_SWAP */ *(q + 0) = r1r_store; /* BFLY3_SU_SWAP */ *(q + 1) = r1i_store; /* BFLY3_SU_SWAP */ *(q + 2) = s1r_store; /* BFLY3_SU_SWAP */ *(q + 3) = s1i_store; /* BFLY3_SU_SWAP */ q += 4; /* BFLY3_SU_SWAP */ FIGURE 7. SHOWING DIRECT CORRESPONDENCE OF C CODE TO TIE LANGUAGE INSTRUCTIONS Figure 8 shows the succession of each TIE instruction in the order of execution. The diagram shows the pipeline of loads, arithmetic instructions, and stores. Each of the four main TIE instructions that form the core butterfly instructions in the bracket labeled LOOP performs a load or store concurrent with the computation of both an r and s result. To understand the tasks in each TIE instruction, please refer to the LOOP label in Figure 8. The loads of new values into a butterfly computation are handled by BFLY0.LU and BFLY1.LU, in two consecutive processor cycles. The computation is handled in successive cycles by all four instructions, in which the BFLY0.LU and BFLY1.LU instructions perform their computation concurrently with the loads. BFLY0.LU and BFLY1.LU move previously computed butterfly results into the various store buffers. And in the fourth cycle of the LOOP, the BFLY3.SU instruction latches the previously loaded load buffers in preparation for computing the next set of results. 15

22 Achieves about one complex butterfly computation per cycle BFLY0.LU Load Buffers a0, a1 BFLY1.LU Load Buffers b0, b1 (STALL) OPLCH Latch Load Buffers to a0r, a0i, b0r, b0i a1r, a1i, b1r, b1i BFLY0.LU Load Buffers a0, a1 Compute r0r, s0r r0r r0r_st r1r r1r_st s0r s0r_st s1r s1r_st BFLY1.LU Load Buffers b0, b1 Compute r0i, s0i r0i r0i_st s0i s0i_st BFLY2 Compute r1r, s1r BFLY3 Latch Load Buffers to a0r, a0i, b0r, b0i a1r, a1i, b1r, b1i Compute, Move r1i r1i_st s1i s1i_st BFLY0.LU Load Buffers a0, a1 Compute r0r, s0r r0r r0r_st r1r r1r_st s0r s0r_st s1r s1r_st BFLY1.LU BFLY2.SU Load Buffers b0, b1 Compute r0i, s0i Compute r1r, s1r r0i r0i_st s0i s0i_st Store r0r_st r0i_st r1r_st r1i_st LOOP BFLY3.SU Latch Load Buffers to a0r, a0i, b0r, b0i, a1r, a1i, b1r, b1i Compute, Move r1i r1i_st s1i s1i_st Store s0r_st s0i_st s1r_st s1i_st FIGURE 8. TIE LANGUAGE INSTRUCTIONS FOR THE FFT ALGORITHM The first seven stages, before the LOOP section of Figure 8 show the two peeled trips through the while() loops in the C code shown above. The first BFLY0.LU and BFLY1.LU instructions compute meaningless results, so these computations are not shown. A TIE instruction that performs the latching of BFLY3.SU but without the store to memory is called OPLCH, and only performs the latches to the values needed in the next four processor cycles. In the next four processor cycles, BFLY0.LU and BFLY1.LU are executed again, followed by variants of the other TIE instructions (BFLY2, BFLY3) that do not perform stores, as there is no meaningful data yet to be stored. It is useful to note that the way that the operations are scheduled affects the requirements for the number of functional units (e.g. multipliers) and buffers needed. The decision-making involved in scheduling computations and deciding on hardware resources is better described as an art and cannot easily be described in a procedure. 16

23 4 Complete C Implementation with TIE Finally, the C code in Appendix A (the function r2_c_tie_fft) incorporates all of the TIE instructions used. The complete TIE description is also given in Appendix A. Additional instructions, such as BFLY0_LU_swap, are necessary because the final 2-point FFTs can be performed on individual values contained within a single 128-bit load. Unlike the prior FFT stages, in which values obtained from separate loads were used to compute the results, the final stage must be performed using values obtained from single loads. 5 Performance For N-point FFTs of length N = 128, 256, 512, and 1024, four different implementations are compared: a. the C implementation compiled for a base Xtensa processor using software multiplies and -O3 compiler optimization, without use of any TIE or additional hardware functional units; b. the C implementation compiled using the -O3 optimization, for a base Xtensa processor with the MUL32 functional unit (and no TIE extensions in use); c. the C implementation compiled for an Xtensa processor using the -O3 optimization as well as with TIE instructions developed for this FFT; d. Hand-coded assembly language routine structured differently from the C code implementations assembled for an Xtensa processor using the TIE instructions developed for this FFT. Table 2 shows the resulting code size and cycle counts of each implementation for different FFT input lengths, and also the performance improvement factor comparing the (d) hand-optimized assembly implementation with the (b) C code compiled on the Xtensa processor including the MUL32 unit. Note that the code size given for (a) the C implementation using software multiplies does not include the multiplication libraries required to execute the program. The code size of (b) is reduced to 430 bytes because a 32-bit multiplier in hardware is used. TABLE 2. CODE SIZE AND PERFORMANCE RESULTS a b c d E = b/d C (with software multiplies) C (using MUL32) C (with TIE) Assembly (with TIE) Performance Improvement Code Size (Bytes) 433+libraries point X Performance (cycles) 256-point X 512-point X 1024-point X 17

24 6 Conclusion Algorithms such as the FFT can be implemented using TIE instructions for the Xtensa processor. This allows for efficient execution using hardware optimizations encapsulated in instruction extensions and for efficient software development using those instructions. This application note describes an efficient implementation of N-point decimation-in-frequency FFT for Xtensa processors. The code found in Appendix A can be compiled and run to demonstrate the performance claimed. I NOTE: To build an Xtensa processor configuration to simulate or synthesize the FFT implementation, there are two requirements. First, as a fundamental requirement, the configuration must use a 128-bit wide Processor Interface (PIF). Also, the TIE and software included in this implementation support a Little Endian configuration. To use a Big Endian configuration, the TIE and code can be adjusted accordingly. IMPLEMENTATION NOTE: 18

25 7 Glossary of Terms Butterfly The term used to describe the computation that comprises the inner loop of a radix-k FFT algorithm that implements an N-point Discrete Fourier Transform. In the radix-2 implementation the butterfly computation sums input points N/2 samples apart to obtain one output, and the second output is obtained by differencing the inputs and multiplying by a twiddle factor. The term was derived from the shape of the pictorial representation. a a+b b -1 c c x (a-b) Decimation-in-frequency Of the two main types of FFT implementations of the DFT that recursively divide the problem in half, this method involves decimating the output (frequency-domain) sequence to simplify the DFT computation. DFT The Discrete Fourier Transform member of the Fourier Transform family; the DFT operates on discrete-time input and produces discrete frequency samples as output. For a common definition, see the equation in Figure 1. MUL32 The Xtensa processor option for 32-bit fixed-point multiplication integrated into the CPU and Xtensa software toolset as a set of optional instructions. N The number of input time-domain samples used in computing the FFT; also the number of resulting output frequency-domain samples. Radix-2 The method that allows expression of an N-point Discrete Fourier Transform as two N/2 - point Discrete Fourier Transforms. TIE (Tensilica Instruction Extension) Language The language that defines custom instructions that are added to the basic instruction set of an Xtensa processor. Twiddle factor In the Discrete Fourier Transform equation, these are the exponential factors which become multipliers of the x[n] input sequence to compute the output X[k] sequence. For an N -point DFT there are N 2 + N 4 + L + 2 twiddle factors whose values are j( ) n n e m m = 2π W a DFT. m = N, N 2, K,2; n = 0,1, K, m 2. Figure 2 shows how the twiddle factors are used in 19

26 Appendix A Complete Application Code This appendix includes the code described in this application note. The code includes the radix_2.c, radix_2_hand_code.s, bfly.tie, and util.h: 20

27 radix_2.c /* * FFT implementation for Xtensa Processors * Radix-2 Decimation-in-frequency * Copyright (c) Tensilica Inc. These coded instructions, statements, and computer programs are Confidential Proprietary Information of Tensilica Inc. and may not be disclosed to third parties or copied in any form, in whole or in part, without the prior written consent of Tensilica Inc */ #include <malloc.h> #include <math.h> #include <assert.h> #include <sys/types.h> #include <sys/times.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "util.h" /* Add this "# include" statment below, to include * C stub simulation of TIE instructions. * Used when when * Use this when confirming TIE instruction functionality * in native environment. * * #include "tdk/cstub-bfly.c" */ int **dif_twiddles_2; int *dif_first_twiddle; #define DEFAULT_BPT 16 static unsigned int BPT = DEFAULT_BPT; #define FIX(x, bpt) ((int) ((x) * (1 << (bpt)) + 0.5)) const double pi = ; extern void r2_tie_fft(int *data, int n, int *first_twiddle); /* Initialize Twiddle factors * * n is twice the number of complex numbers in the input. */ void init_r2_twiddles(int n, int log_2_n) { int td, i; dif_twiddles_2 = (int **) malloc((log_2_n - 1) * sizeof(int *)); dif_twiddles_2[0] = (int *) malloc((n + 4) * sizeof(int) - 1); /* Align the array by force */ if ((((unsigned) dif_twiddles_2[0]) & 0xf)!= 0) { *((unsigned *) &dif_twiddles_2[0]) += 0x10; *((unsigned *) &dif_twiddles_2[0]) &= ~0xf; n >>= 1; for (i = 0; i < n; i += 2) { double angle = -(pi * i) / n; 21

28 dif_twiddles_2[0][i + 0] = FIX(cos(angle), BPT); dif_twiddles_2[0][i + 1] = FIX(sin(angle), BPT); n >>= 1; /* Now n is the number of (data, not twiddle) complex numbers at this subproblem level. */ for (td = 1; td < log_2_n - 2; td += 1, n >>= 1) { dif_twiddles_2[td] = dif_twiddles_2[td - 1] + (n << 1); for (i = 0; i < n; i += 2) { dif_twiddles_2[td][i + 0] = dif_twiddles_2[td - 1][2 * i + 0]; dif_twiddles_2[td][i + 1] = dif_twiddles_2[td - 1][2 * i + 1]; /* Last level is special; we duplicate the twiddle factor 1 because of the SIMD load. */ dif_twiddles_2[td] = dif_twiddles_2[td - 1] + (n << 1); dif_twiddles_2[td][0] = dif_twiddles_2[td][2] = dif_twiddles_2[td - 1][0]; dif_twiddles_2[td][1] = dif_twiddles_2[td][3] = dif_twiddles_2[td - 1][1]; dif_first_twiddle = dif_twiddles_2[0]; /* Sample bit reversal computation for the * bit_reverse_complex_array() routine below */ int bit_reverse(int i, int p) { int a = 0; int j; for (j = 0; j < p; ++j) { a = (a << 1) + (i & 0x01); i = i >> 1; assert(i == 0); return a; /* Sample bit reversal routine to rearrange elements of an array 'a' * It must be the case that n == 2^p. n is the number of * (real, imaginary) complex pairs in the array a. */ void bit_reverse_complex_array(int a[], int n, int p) { int i, j; int t; for (i = 0; i < n; ++i) { j = bit_reverse(i, p); if (j > i) { /* real part */ t = a[2 * i + 0]; a[2 * i + 0] = a[2 * j + 0]; a[2 * j + 0] = t; /* imaginary part */ t = a[2 * i + 1]; a[2 * i + 1] = a[2 * j + 1]; 22

29 a[2 * j + 1] = t; static void usage(const char *const pname) { printf("usage: %s [-p <bits_of_precision>] infile\n", pname); exit(-1); #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << 15); /* C routine, unoptimized */ void r2_fft(int *data, int n, int *twiddles) { register int i; register int l = n; register int *t = twiddles; while (l > 1) { for (i = 0; i < l; i += 2) { register int *p = data + i; register int *q = p; register int wr = *(t++); register int wi = *(t++); while (p < &data[n << 1]) { register int d0r, d0i; register int d1r, d1i; register int r0r, r0i; register int r1r, r1i; register long long tr, ti; d0r = *(p + 0); d0i = *(p + 1); p += l; d1r = *(p + 0); d1i = *(p + 1); p += l; r0r = d0r + d1r; r0i = d0i + d1i; tr = d0r - d1r; ti = d0i - d1i; r1r = DESCALE(tr * wr, round) - DESCALE(ti * wi, round); r1i = DESCALE(tr * wi, round) + DESCALE(ti * wr, round); *(q + 0) = r0r; *(q + 1) = r0i; q += l; *(q + 0) = r1r; *(q + 1) = r1i; q += l; l >>= 1; 23

Double-Precision Floating Point Emulation Acceleration

Double-Precision Floating Point Emulation Acceleration Application Note Tensilica, Inc. 3255-6 Scott Blvd. Santa Clara, CA 95054 (408) 986-8000 Fax (408) 986-8919 www.tensilica.com December 2007 Doc Number: