Implementing the Fast Fourier Transform for the Xtensa Processor
|
|
- Kelly Phillips
- 6 years ago
- Views:
Transcription
1 Implementing the Fast Fourier Transform for the Xtensa Processor Application Note Tensilica, Inc Scott Blvd. Santa Clara, CA (408) Fax (408) November 2005 Doc Number: AN
2 2005 Tensilica, Inc. Printed in the United States of America All Rights Reserved This publication is provided AS IS. Tensilica, Inc. (hereafter Tensilica ) does not make any warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Information in this document is provided solely to enable system and software developers to use Tensilica processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual property rights or licenses granted hereunder to design or fabricate Tensilica integrated circuits or integrated circuits based on the information in this document. Tensilica does not warrant that the contents of this publication, whether individually or as one or more groups, meets your requirements or that the publication is error-free. This publication could include technical inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated in new editions of this publication. Tensilica is a registered trademark of Tensilica, Inc. The following terms are trademarks of Tensilica, Inc.: OSKit, Vectra, and Xtensa. All other trademarks and registered trademarks are the property of their respective companies. NO LIABILITY FOR CONSEQUENTIAL DAMAGES.TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL Tensilica BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING WITHOUT LIMITATION, SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR INDIRECT DAMAGES FOR PERSONAL INJURY, LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, OR ANY OTHER PECUNIARY LOSS) ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PRODUCT CODE, EVEN IF MANUFACTURER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Document Change History: First Published, February 2001, revised 3/2005 Updated for LX/Xtensa 6, 11/2005 ii
3 Contents 1 Introduction Basic C Implementation Development of TIE Language Instructions from the C Algorithm...5 General Cases for Loop Pipelining...5 Case Case TIE Development Steps Complete C Implementation with TIE Performance Conclusion Glossary of Terms...19 Appendix A Complete Application Code...20 radix_2.c...21 radix_2_hand_code.s...27 bfly.tie...30 util.h...37 util.c...38 Appendix B Application Output...40 Appendix C Import and use of workspace...41 iii
4 Figures Figure 1. Definition of the Discrete Fourier Transform... 1 Figure 2. Decimation-in-Frequency DFT in Terms of Even- and Odd-numbered Frequency Samples 2 Figure 3. FFT Decimation-in-Frequency Algorithm... 3 Figure 4. Basic Implementation of Radix-2 Decimation-in-Frequency FFT Algorithm... 4 Figure 5. Complex Butterfly Computation... 5 Figure 6. First Step in Converting FFT Algorithm into TIE Implementation... 9 Figure 7. Showing Direct Correspondence of C Code to TIE Language Instructions Figure 8. TIE Language Instructions for the FFT Algorithm Tables Table 1. Performance Results of Xtensa FFT Implementation... 1 Table 2. Code Size and Performance Results iv
5 Abstract The goal of this application note is to show the results and design methodology for a highperformance DSP sample application on the Xtensa microprocessor using a widely known example, the Fast Fourier Transform (FFT). This note first explains the basic algorithm and how several TIE language instructions were created to implement the FFT algorithm. Performance results follow, with a comparison of implementations of the radix-2 decimation-in-frequency FFT with and without additional TIE language extensions. This application note assumes the reader has a basic understanding of the ASIC design methodology used with Xtensa processors and some familiarity with digital signal processing. Preface The Implementing the Fast Fourier Transform for the Xtensa Processor application note was LX and Xtensa 6 release of the Xtensa processor. Location Old: New: Appendix A radix_2.c /* Add this "# include" statement below, to include * C stub simulation of TIE instructions. * Used when when * Use this when confirming TIE instruction functionality * in native environment. * * #include "tdk/cstub-bfly.c" */ #include "tdk/cstub-bfly.c" Location Appendix A util.c, end of source Old: return 0; New: return 0; Additional Notes: Section 4 Complete C Implementation with TIE (Addendum) The testbench requires an input file that describes the time-domain complex samples for the FFT algorithm to process. The input file is an ASCII file. The first line of this file states the number of floating point data on the following lines of the file. Proceeding lines will contain the floating point data samples. The testbench will automatically convert the floating point sample to 16bit fixed point format. For example, a 256-point complex data set will have 256 real and 256 imaginary samples. Therefore, 512 data points are listed. The samples are given as floating point numbers from 1 to Complex pairs are listed as real sample first and imaginary sample second. Each sample is listed on a new line. v
6 An example of the ASCII input file for a square wave is shown below: (pattern repeats 63 more times) Build and exercise the FFT testbench for the Xtensa LX/6 processor. Import the workspace into Xplorer and select the Run or Profile button. Alternately, enter the following commands after the configuration is installed and the environment is properly set up: >tc d tdk bfly.tie >set XTENSA_PARAMS=.\tdk (or setenv XTENSA_PARAMS./tdk, for Unix/Linux) >xt-gcc O3 radix_2.c lm util.c rasix_2_hand_code.s o fft >xt-run --pipe fft <input_file> vi
7 1 Introduction One of the most widely used algorithms in digital signal processing applications is the Fast Fourier Transform (FFT), which covers a family of techniques for computing the Discrete Fourier Transform (DFT), as defined in the equation in Figure 1. The FFT is a large class of algorithms applicable to a variety of signal processing applications such as audio, video, and speech. For the purpose of this application note, one of these algorithms is highlighted for its simplicity, and is used to show how to achieve performance improvements on Xtensa processors. X[ k] = 1 N N 1 n= 0 x[ n] e 2π j( ) nk N k = 0,1, K, N 1 FIGURE 1. DEFINITION OF THE DISCRETE FOURIER TRANSFORM. In Figure 1, X [k] are the N frequency samples being computed, x[n] are the time-domain input samples, n is the index of time-domain samples, and k is the index of the frequencydomain samples. Also, j is defined such that j = 1 following a common convention. Note 2 that for convenience, the 1 N scaling factor is omitted in the Xtensa implementation described in this document. The purpose of this application note is to explain an optimized implementation of the FFT for the Xtensa processor using Tensilica Instruction Extension (TIE) language constructs. The TIE constructs are optimized instructions that are integrated into the Xtensa processor pipeline. Using the TIE language involves considering what operations can benefit most from hardware optimization. This application note will show that optimizing the FFT butterfly operations with TIE language hardware extensions attains results competitive with high-performance DSP implementations of FFT. The following table summarizes the performance obtained by using the C code and handoptimized assembly code, both using TIE language instructions. This application note focuses on TIE language development for the case of the FFT algorithm implemented in C. The development of the assembly implementation is not discussed in detail in this application note, as it incorporates the same TIE instructions as the C implementation. The performance improvement of the assembly code is attributable to its special structure, which differs from that of the C implementation. In addition to the TIE optimization we applied IPA (Interprocedural Analysis) and Profile Directed Compilation. Both of these optimizations are Xtensaspecific optimizations leveraging from the Xtensa tool chain. TABLE 1. PERFORMANCE RESULTS OF XTENSA FFT IMPLEMENTATION With TIE C Assembly Code Size (Bytes) Performance 128-point (cycles) 256-point point point For a complete background and detailed derivations of the various FFT algorithms, refer to one of the many textbooks on the subject (for example, Discrete-Time Signal Processing by Oppenheim, et al., Prentice-Hall, 1999). Also, please note that Tensilica offers a set of FFT software libraries for the Vectra TM DSP Engine, available as a coprocessor option to the Xtensa processor. The Vectra engine is optimized for a broad array of fixed-point arithmetic functions. The techniques described in this 1
8 application note offer a dedicated solution for acceleration of the FFT algorithm with TIE instructions. To understand which approach is most appropriate for your application, please refer to the Vectra DSP Engine User s Guide, or consult an Applications Engineer of Tensilica, Inc. 2 Basic C Implementation When N is divisible by 2, the N -point DFT can be recursively decomposed into two N 2 -point DFTs. This decomposition is the basis for what is commonly described as a radix-2 FFT algorithm. Additionally, there are two canonical ways to perform a decomposition into two DFTs. One way, known as decimation-in-time, uses the even and odd time-domain samples (decimated sequences) as input to the two DFTs. The second approach is to decompose into two DFTs such that their results are the even and odd frequency domain sample sequences, or decimated output sequences this is known as decimation-in-frequency. The decimation-in-frequency decomposition of the DFT simplifies the DFT summation, defined previously in Figure 1, by decimating the frequency domain samples by 2. The summation is split into two summations across the even-numbered frequency samples X[ 2k] for k = 0,1, K, N 2 1, as well as the odd-numbered samples X [ 2k + 1], for k = 0,1, K, N 2 1. The decomposition is simplified into two equations that represent the DFT results in the even and odd cases; the two equations are given in Figure 2: ( N 2) X[2r] = ( N 2) X[2r + 1] = 1 n= 0 1 n= 0 ( x[ n] + x[ n + ( N 2) ] ) ( x[ n] x[ n + ( N 2) ]) e e 2π j( ) n N 2π j( ) nr N 2 e 2π j( ) nr N 2 ( 2) 1 r = 0,1, K, N ( 2) 1 r = 0,1, K, N FIGURE 2. DECIMATION-IN-FREQUENCY DFT IN TERMS OF EVEN- AND ODD-NUMBERED FREQUENCY SAMPLES 2 j( π ) n n N It is convenient to define a set of factors W N e n 0,1, K, N 2 for the computation of the odd-numbered samples in the DFT decomposition. As shown in Figure 2, performing the decomposition involves additions, subtractions, and multiplications by the = over the set { factors W. Additional decomposition stages require the factorsw 2, When n N n WN 4, n WN 8, and so on. N is a power of 2, then the DFT can be implemented by log ( N ) such recursive 2 j( ) n = π n decompositions. The quantities W e m m m = N, N 2, K,2; n = 0,1, K, m 2 required in the FFT stages are referred to as twiddle factors. Any stage can be implemented by the original definition (in Figure 1) or by a decomposition. In the case that all log 2 ( N ) stages are implemented by the decimation-in-frequency decomposition, then each stage requires only adds, subtracts, and multiplies by twiddle factors. Figure 3 depicts the radix-2 decimation-in-frequency FFT. This paper will focus on implementing an N -point decimation-in-frequency FFT computation for complex input values. 2 2
9 N/2 point DFT W N 1 W N 2 W N W 3 N N/2 point DFT W N N/2-2 W N N/2-1 FIGURE 3. FFT DECIMATION-IN-FREQUENCY ALGORITHM Our FFT algorithm implementations use the following conventions: The complex N -point input to the FFT is organized as an array of 2N components, with elements alternating in correspondence to real and imaginary components of the input N -point sequence. The twiddle factors are stored as a set of arrays used as input to the FFT algorithm: 0 N N N 4 1 { W N L WN ; WN 2 LWN 2 ; WN 4 LWN 4 ; K The real and imaginary components of the twiddle factors alternate in the sequence. Each successive sequence contains half as many twiddle factors as the previous one. The computation assumes a 24-bit fixed-point representation with 16 places to the right of the binary point. The basic algorithm implemented in C is shown in Figure 4. 3
10 #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << (15)) void r2_fft_basic(int *data, int n, int *twiddles) { register int i, d0r, d0i, d1r, d1i, r0r, r0i, r1r, r1i; register int *p, *q, wr, wi; register long long tr, ti; register int m = n; register int *t = twiddles; while (m > 1) { /* Outer loop (Decomposition into stages)*/ for (i = 0; i < m; i += 2) { /* Middle loop */ p = data + i; q = p; wr = *(t++); wi = *(t++); while (q < &data[n << 1]) { /* Inner loop */ d0r = *(p + 0); d0i = *(p + 1); p += m; d1r = *(p + 0); d1i = *(p + 1); p += m; r0r = d0r + d1r; r0i = d0i + d1i; tr = d0r - d1r; ti = d0i - d1i; r1r = DESCALE(tr * wr, round) - DESCALE(ti * wi, round); r1i = DESCALE(tr * wi, round) + DESCALE(ti * wr, round); *(q + 0) = r0r; *(q + 1) = r0i; q += m; *(q + 0) = r1r; *(q + 1) = r1i; q += m; m >>= 1; FIGURE 4. BASIC IMPLEMENTATION OF RADIX-2 DECIMATION-IN-FREQUENCY FFT ALGORITHM Here, the focus is on the series of radix-2 decompositions of an N -point DFT in which N is a power of 2; there are log 2 ( N) stages of butterfly operations being performed on the input. At the top of the algorithm, the DESCALE macro is defined both to keep 24-bits of accuracy in the result of a multiply operation and also to maintain consistency in precision between the C implementation and the Xtensa implementation that uses hardware defined in the TIE language, to give equally accurate results. By adding the round value to a 64-bit integer, then scaling the sum by 2-16, effectively the 64-bit integer is converted into a fixed-point value, rounded to the least-significant bit. log 2 successive stages of DFT computations. The for() called the middle loop initializes both p to point to the first data element of the current stage and also q to point to the same data element which will be overwritten with the output data. The outer while() loop counts the ( N ) 4
11 The innermost while() loop, called the inner loop, traverses the input elements for the current butterfly stage. For each stage h, where { 1, 2,, log 2 ( N ) 1 h K, there are 2 h 1 subproblems, that is, DFTs of length N 2 each. Each subproblem has a set of N 2 h+ butterflies that use two values each as input and compute two output values. The real and imaginary parts of the butterfly computation are shown in Figure 5. h ar sr = ar+br ai si = ai+bi br -1 cr sr = cr(ar-br) bi -1 cr si = ci(ai-bi) a FIGURE 5. COMPLEX BUTTERFLY COMPUTATION In each stage, however, each pass through the inner loop computes one set of butterfly results per subproblem. This order of computation allows for reuse of the same twiddle factor for all subproblems in the same FFT stage. Once the final group of butterflies are computed for all 2h subproblems, the next stage is h + 1, as m is further divided by 2 to prepare for another trip through the outer while() loop. 3 Development of TIE Language Instructions from the C Algorithm This section explains the function of the TIE language implementation. TIE instructions encapsulate functionality of the algorithm in hardware. For more detail on the TIE language basics, refer to the Tensilica Instruction Extension (TIE) Language User s Guide. The goal of this document is to show the functionality of the TIE implementation of the FFT algorithm by converting the simple C implementation, step-by-step, into an efficient C routine that uses TIE instructions. The goal for the TIE implementation of the radix-2 FFT is to achieve maximal performance with a reasonable hardware cost. TIE allows for performing an arbitrary number of parallel operations in a single instruction. However, it is not possible to perform more than one load or store per instruction or per cycle. Therefore, one general TIE programming technique is to fold computation operation into TIE load or store instructions. If every computational instruction is combined with a load or store instruction, a load or store instruction will be issued on each cycle. Without algorithmic changes, such an approach will lead to an optimal number of issued instructions. For many algorithms including the radix-2 FFT, it is possible to fold all computational instructions into load or store instructions with reasonable hardware costs. Even with minimizing the number of instructions, folded instructions might have large latencies that are either infeasible to build or lead to stalls in the pipeline. Adding hardware resources can eliminate this problem. New data can be loaded in parallel with computations done on old data, and even older results can be stored in parallel with the current computation. As long as different iterations of a loop do not depend on each other, this technique comes at the cost of additional hardware to buffer the results. This technique is popular in software and, in that context, is called software pipelining; here the pipeline is being managed in hardware. General Cases for Loop Pipelining To begin optimizing the FFT inner loop with TIE instructions, it makes sense to look at ways to optimize load-operation-store loops in general. This loop structure is applicable because the 5
12 inner loop of the general FFT algorithm in Figure 4 consists of loads, then computation, then stores. The loads bring in the new ar, ai, br, and bi, four input values. The computation creates four output values, rr, r1, sr, and si, and the outputs are stored at the end of the loop body. While the Xtensa architecture issues at most one load or store per processor cycle, the TIE instructions can be written with computations in parallel, limited by the amount of compute hardware available. Unrolling the load-operation-store loop offers the advantage of storing a previous result and computing the next result in parallel. The following two cases find the maximum latency allowed for computation. Case 1 Case 1 presents a load-operation-store pipeline loop with two inputs and one output and a computation latency of n cycles load 2 A load 2 B comp 1 C store 1 C compute 2 C store 2 C load 3 A load 3 B compute 3 C store 3 C load 4 A load 4 B 1 cycle 2 cycles 3 cycles comp C compute C compute C In this diagram, the loop is software pipelined and the loop iteration is indicated by subscripts. The first two loads of values A and B are not shown, but the second two loads (load2) are followed by a store instruction for the previous result (store1). Two more loads follow, which gives up to three cycles during which the latency of a compute operation can be hidden. Hence, for n=1, n=2, or n=3 cycles, there is no cycle-count penalty for performing a computation in parallel with the loads and stores. Thus, TIE instructions with appropriate loop pipelining can be used to perform computations in parallel with the loads and stores. Case 2 Case 2 presents a load-operation-store pipeline loop with two inputs, two outputs, and a computation latency of n cycles. comp 1 C,D store 1 C store 1 D load 2 A load 2 B compute 2 C,D store 2 C store 2 D load 3 A load 3 B compute 3 C,D load 4 A load 4 B load 5 A load 5 B store 3 C store 3 D compute 4 C,D compute 2 C compute 2 D 2 cycles 2 cycles 6 In this case the goal is to compute two results, which requires two store instructions. Software pipelining this loop requires two cycles for the loads and two cycles for the stores; hence, it is possible to perform in parallel two computations whose latency is two cycles each. If the computation is on more than one set of input data, efficient use of the hardware can be made while hiding latency. This case is similar to the FFT implementation of this application note
13 which uses a 2-cycle multiplier in each cycle on two sets of real and imaginary input (for 4 results) in the 4 slots available. The loop optimization described sets the tone for the way chosen to convert the general C algorithm into an efficient FFT implementation for an Xtensa processor with TIE. TIE Development Steps There are other optimization techniques applied to the FFT implementation. A technique called SIMD (Single Instruction Multiple Data) can accommodate simultaneous pipelining of independent computations. As Xtensa provides for 128-bit memory accesses, wide loads of four integer values at a time can address the need for two sets of two inputs for each butterfly. Two 128-bit loads best utilizes the available hardware to perform two butterfly computations. The structure of loads and stores is SIMD in that the real and imaginary components of two consecutive inputs can be loaded in one cycle. Similarly, the components of two complex outputs can be stored in one cycle, allowing the latency of two butterfly computations to be hidden. Figure 6 shows the result of several changes made to the basic C algorithm to create an optimized algorithm using TIE. First, the middle loop is unrolled and the bodies of the two resulting inner loop copies are combined. Duplicated variables perform two complex butterfly computations in the inner loop body. The following list summarizes the copies of variables: Variable c0r, coi, c1r, c1i a0r, a0i, a1r, a1i b0r, b0i, b1r, b1i r0r, r0i, r1r, r1i t0r, t0i, t1r, t1i s0r, s0i, s1r, s1i Meaning Twiddle factors (two sets of real and imaginary twiddle factors from adjacent elements of the twiddle factor array twiddles) Input data Input data at the alternate butterfly wing. b0 and b1 are each m samples away from the corresponding input data a0 and a1 Results of butterfly computations (upper wing which gives the sum of the butterfly inputs) Intermediate values used to find outputs to current butterfly stage (lower wing) Results of butterfly computations (lower wing which gives product of twiddle factor and difference of inputs) It is necessary to peel off the last trip through the outer loop because when N = 2 (the last stage of radix-2 decompositions), the butterfly computations are performed on consecutive input data. In prior butterfly computations, it is sufficient to use one address to load a0 r + j a0i and a1 r + j a1i, whose real and imaginary components were all consecutive in memory, and another address to load the values corresponding to b0 r + j b0i and b1 r + j b1i. But in the last trip, looking at four consecutive complex values gives a0 r + j a0i and b0 r + j b0i since the two inputs to each butterfly correspond to consecutive output values from the previous stage. This special characteristic of the last trip explains why the values of a0r, a0i, b0r, and b0i are loaded in this different order in the last trip. 7
14 #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << 15); void r2_fft_step1(int *data, int n, int *twiddles) { register int i, c0r, c0i, c1r, c1i, a0r, a0i, a1r, a1i; register int b0r, b0i, b1r, b1i, r0r, r0i, r1r, r1i, s0r, s0i, s1r, s1i; register long long t0r, t0i, t1r, t1i; register int m = n; register int *t = twiddles; register int *p, *q; while (m > 2) { /* Outer loop (Decomposition into stages)*/ for (i = 0; i < m; i += 4) { /* Middle loop */ p = data + i; q = p; c0r = *(t++); /* SIMD: c[0:3] = t[0:3] */ c0i = *(t++); c1r = *(t++); c1i = *(t++); while (q < &data[n << 1]) { /* Inner loop */ a0r = *(p + 0); /* SIMD: a[0:3] = p[0:3] */ a0i = *(p + 1); a1r = *(p + 2); a1i = *(p + 3); p += m; b0r = *(p + 0); /* SIMD: b[0:3] = p[0:3] */ b0i = *(p + 1); b1r = *(p + 2); b1i = *(p + 3); p += m; r0r = a0r + b0r; r0i = a0i + b0i; t0r = a0r - b0r; t0i = a0i - b0i; s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); r1r = a1r + b1r; r1i = a1i + b1i; t1r = a1r - b1r; t1i = a1i - b1i; s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); *(q + 0) = r0r; /* SIMD: q[0:3] = r[0:3] */ *(q + 1) = r0i; *(q + 2) = r1r; *(q + 3) = r1i; q += m; *(q + 0) = s0r; /* SIMD: q[0:3] = s[0:3] */ *(q + 1) = s0i; *(q + 2) = s1r; 8
15 *(q + 3) = s1i; q += m; m >>= 1; /* Final trip through Middle loop (case m = 2) */ p = data; q = p; c0r = *(t++); c0i = *(t++); c1r = *(t++); c1i = *(t++); while (q < &data[n << 1]) { /* Inner loop */ a0r = *(p + 0); a0i = *(p + 1); b0r = *(p + 2); /* b0r,b0i swapped with a1r,a1i */ b0i = *(p + 3); p += 4; a1r = *(p + 0); a1i = *(p + 1); b1r = *(p + 2); b1i = *(p + 3); p += 4; r0r = a0r + b0r; r0i = a0i + b0i; t0r = a0r - b0r; t0i = a0i - b0i; s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); r1r = a1r + b1r; r1i = a1i + b1i; t1r = a1r - b1r; t1i = a1i - b1i; s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); *(q + 0) = r0r; *(q + 1) = r0i; *(q + 2) = s0r; *(q + 3) = s0i; q += 4; *(q + 0) = r1r; *(q + 1) = r1i; *(q + 2) = s1r; *(q + 3) = s1i; q += 4; FIGURE 6. FIRST STEP IN CONVERTING FFT ALGORITHM INTO TIE IMPLEMENTATION With two sets of real and imaginary results being computed in each trip through the inner loop, the computation can now be pipelined. The goal is an implementation with loads or stores in every cycle and compute operations in parallel: 9
16 store store compute compute compute compute store store load load compute compute compute compute store store load load compute compute compute compute load load When pipelining computations with load/store operations, TIE state registers are required to buffer data retrieved from memory and store to memory. Meaningful computations can be performed on state registers only after values have been loaded into them; hence, new data is loaded into TIE state, and computations on older data can be performed on data previously latched from those state (input buffers) into different state registers. With these TIE considerations in mind, variables are needed that will act as load buffers into which the load first brings input data. Additionally, variables are created to act as store buffers that receive the results of the compute operations before they are stored to memory. Figure 7 shows an implementation in which additional buffer variables have been added. NOTE: These variables will directly correspond to TIE state registers; loads will move data directly from memory into TIE state, compute operations acting on TIE state can act independently of loads and stores, and stores move values from TIE state back into memory. In each iteration of the inner loop, the load buffers can hold the next four input values, or in other words, the real and imaginary components of the FFT input values a0 r + j a0i and a1 r + j a1i. This implementation will use 24-bit TIE state registers, and with a processor interface (PIF) that is 128 bits wide, loads are issued in two consecutive cycles to retrieve the values corresponding to two butterfly inputs. In our C implementation the variables a0r_load, a0i_load, a1r_load, a1i_load and also b0r_load, b0i_load, b1r_load, and b1i_load are used to represent the load buffers. In another two processor cycles, the already-computed FFT butterfly-stage outputs r0r, r0i, s0r, s0i, r1r, r1i, s1r, and s1i can be stored. There are also assignment instructions (after the outputs are computed) that latch the load buffer registers into the variables required for the computation. In the load-compute-store pipeline being implemented, the latching prepares for the next computation cycle. The direct correspondence between the C code given and the TIE implementation is shown as comments in Figure 7. The four basic instructions, BFLY0.LU, BFLY1.LU, BFLY2.SU, and BFLY3.SU have been created with variants used to perform the first two trips through the inner loop to prime the pipeline. In Figure 7 the inner loop is peeled to prime the pipeline. Also, the last trip through the middle loop for the case m=2 is peeled because the inputs to the butterfly computations on this pass are consecutive in memory, as discussed above. Before explaining the details of each TIE instruction, this paper examines how the C code for the FFT is written so that it is ready to have a direct correspondence with TIE; this is the essence of TIE development. It is useful to understand the TIE definition and processor restrictions that characterize how an instruction extension can be defined for the Xtensa processor. The Xtensa processor restricts the number of memory operations per cycle to one, but the width of the Processor Interface (PIF) determines the throughput. With a PIF width limit of 128 bits, a throughput of two sets of complex butterfly inputs in two cycles is achievable; the throughput of stores is the same. In turn, the data throughput determines how much computation must be accomplished, and moving the data into load buffers defined as TIE state registers allows as much computation as is necessary. 10
17 void The limitation of one load or store per cycle determines the number of cycles required to do one iteration of the computation. In turn, the number of cycles affects the amount of other hardware resources required. For example, in the FFT implementation each SIMD iteration (two original iterations) requires 8 multiplies and will execute in 4 cycles. Hence two multipliers (one SIMD multiplier) are required. In turn, the latency of the operations determining how many loop iterations ahead loads should happen ahead of the computation. In this FFT implementation, the two loads are placed one iteration ahead of the computation, and the two stores happen one iteration later than the computation. r2_fft_step3(int *data, int n, int *twiddles) { register int i, c0r, c0i, c1r, c1i, a0r, a0i, a1r, a1i; register int b0r, b0i, b1r, b1i, r0r, r0i, r1r, r1i; register long long t0r, t0i, t1r, t1i; register int s0r, s0i, s1r, s1i; register int a0r_load, a0i_load, a1r_load, a1i_load; register int b0r_load, b0i_load, b1r_load, b1i_load; register int r0r_store, r0i_store, r1r_store, r1i_store; register int s0r_store, s0i_store, s1r_store, s1i_store; register int m = n; register int *t = twiddles; register int *p, *q; while (m > 2) { /* Outer loop (Decomposition into stages) */ for (i = 0; i < m; i += 4) { /* Middle loop */ p = data + i; q = p; c0r = *(t++); /* LDC_LU */ c0i = *(t++); /* LDC_LU */ c1r = *(t++); /* LDC_LU */ c1i = *(t++); /* LDC_LU */ /* While loop below has been peeled so that the first 2 trips through * the loop are simplified. */ /* The first peeled trip through the loop is simplified so that * it doesn't perform arithmetic or store results */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; /* BFLY0_LU */ b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; /* BFLY1_LU */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* OPLCH */ 11
18 b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* OPLCH */ /* The second peeled trip through the loop is simplified so that * it doesn't store results */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; r0r = a0r + b0r; /* BFLY0 */ r0i = a0i + b0i; /* BFLY1 */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0, BFLY1 */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0 */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1 */ r1r = a1r + b1r; /* BFLY2 */ r1i = a1i + b1i; /* BFLY3 */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2, BFLY3 */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2 */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3 */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3 */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3 */ while (q < &data[n << 1]) { /* Inner loop */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; /* BFLY0_LU */ b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; /* BFLY1_LU */ r0r_store = r0r; s0r_store = s0r; r1r_store = r1r; s1r_store = s1r; /* BFLY0_LU */ r0i_store = r0i; s0i_store = s0i; /* BFLY1_LU */ r1i_store = r1i; s1i_store = s1i; /* BFLY3_SU */ r0r = a0r + b0r; /* BFLY0_LU */ r0i = a0i + b0i; /* BFLY1_LU */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU, BFLY1_LU */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU */ 12
19 r1r = a1r + b1r; /* BFLY2_SU */ r1i = a1i + b1i; /* BFLY3_SU */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_SU, BFLY3_SU */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_SU */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_SU */ a0r =a0r_load; a0i =a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3_SU */ b0r =b0r_load; b0i =b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_SU */ *(q + 0) = r0r_store; /* BFLY2_SU */ *(q + 1) = r0i_store; /* BFLY2_SU */ *(q + 2) = r1r_store; /* BFLY2_SU */ *(q + 3) = r1i_store; /* BFLY2_SU */ q += m; /* BFLY2_SU */ *(q + 0) = s0r_store; /* BFLY3_SU */ *(q + 1) = s0i_store; /* BFLY3_SU */ *(q + 2) = s1r_store; /* BFLY3_SU */ *(q + 3) = s1i_store; /* BFLY3_SU */ q += m; /* BFLY3_SU */ m >>= 1; /* Final trip through Middle Loop (m=2) */ p = data; q = p; c0r = *(t++); /* LDC_LU */ c0i = *(t++); /* LDC_LU */ c1r = *(t++); /* LDC_LU */ c1i = *(t++); /* LDC_LU */ /* While loop below has been peeled so that the first 2 trips through * the loop are simplified. */ /* The first peeled trip through the loop is simplified so that * it doesn't perform arithmetic or store results */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* OPLCH */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* OPLCH */ 13
20 /* The second peeled trip through the loop is simplified so that * it doesn't store results */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ r0r = a0r + b0r; /* BFLY0_LU_SWAP */ r0i = a0i + b0i; /* BFLY1_LU_SWAP */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU_SWAP, BFLY1_LU_SWAP */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU_SWAP */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU_SWAP */ r1r = a1r + b1r; /* BFLY2_LU_SWAP */ r1i = a1i + b1i; /* BFLY3_LU_SWAP */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_LU_SWAP, BFLY3_LU_SWAP */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_LU_SWAP */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_LU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3_LU_SWAP */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_LU_SWAP */ while (q < &data[n << 1]) { /* Inner loop */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ r0r_store = r0r; s0r_store = s0r; r1r_store = r1r; s1r_store = s1r; /* BFLY0_LU_SWAP */ r0i_store = r0i; s0i_store = s0i; /* BFLY1_LU_SWAP */ r1i_store = r1i; s1i_store = s1i; /* BFLY3_SU_SWAP */ r0r = a0r + b0r; /* BFLY0_LU_SWAP */ r0i = a0i + b0i; /* BFLY1_LU_SWAP */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU_SWAP, BFLY1_LU_SWAP */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU_SWAP */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU_SWAP */ r1r = a1r + b1r; /* BFLY2_SU_SWAP */ r1i = a1i + b1i; /* BFLY3_SU_SWAP */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_SU_SWAP, BFLY3_SU_SWAP */ 14
21 s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_SU_SWAP */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_SU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i =a1i_load; /* BFLY3_SU_SWAP */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_SU_SWAP*/ *(q + 0) = r0r_store; /* BFLY2_SU_SWAP */ *(q + 1) = r0i_store; /* BFLY2_SU_SWAP */ *(q + 2) = s0r_store; /* BFLY2_SU_SWAP */ *(q + 3) = s0i_store; /* BFLY2_SU_SWAP */ q += 4; /* BFLY2_SU_SWAP */ *(q + 0) = r1r_store; /* BFLY3_SU_SWAP */ *(q + 1) = r1i_store; /* BFLY3_SU_SWAP */ *(q + 2) = s1r_store; /* BFLY3_SU_SWAP */ *(q + 3) = s1i_store; /* BFLY3_SU_SWAP */ q += 4; /* BFLY3_SU_SWAP */ FIGURE 7. SHOWING DIRECT CORRESPONDENCE OF C CODE TO TIE LANGUAGE INSTRUCTIONS Figure 8 shows the succession of each TIE instruction in the order of execution. The diagram shows the pipeline of loads, arithmetic instructions, and stores. Each of the four main TIE instructions that form the core butterfly instructions in the bracket labeled LOOP performs a load or store concurrent with the computation of both an r and s result. To understand the tasks in each TIE instruction, please refer to the LOOP label in Figure 8. The loads of new values into a butterfly computation are handled by BFLY0.LU and BFLY1.LU, in two consecutive processor cycles. The computation is handled in successive cycles by all four instructions, in which the BFLY0.LU and BFLY1.LU instructions perform their computation concurrently with the loads. BFLY0.LU and BFLY1.LU move previously computed butterfly results into the various store buffers. And in the fourth cycle of the LOOP, the BFLY3.SU instruction latches the previously loaded load buffers in preparation for computing the next set of results. 15
22 Achieves about one complex butterfly computation per cycle BFLY0.LU Load Buffers a0, a1 BFLY1.LU Load Buffers b0, b1 (STALL) OPLCH Latch Load Buffers to a0r, a0i, b0r, b0i a1r, a1i, b1r, b1i BFLY0.LU Load Buffers a0, a1 Compute r0r, s0r r0r r0r_st r1r r1r_st s0r s0r_st s1r s1r_st BFLY1.LU Load Buffers b0, b1 Compute r0i, s0i r0i r0i_st s0i s0i_st BFLY2 Compute r1r, s1r BFLY3 Latch Load Buffers to a0r, a0i, b0r, b0i a1r, a1i, b1r, b1i Compute, Move r1i r1i_st s1i s1i_st BFLY0.LU Load Buffers a0, a1 Compute r0r, s0r r0r r0r_st r1r r1r_st s0r s0r_st s1r s1r_st BFLY1.LU BFLY2.SU Load Buffers b0, b1 Compute r0i, s0i Compute r1r, s1r r0i r0i_st s0i s0i_st Store r0r_st r0i_st r1r_st r1i_st LOOP BFLY3.SU Latch Load Buffers to a0r, a0i, b0r, b0i, a1r, a1i, b1r, b1i Compute, Move r1i r1i_st s1i s1i_st Store s0r_st s0i_st s1r_st s1i_st FIGURE 8. TIE LANGUAGE INSTRUCTIONS FOR THE FFT ALGORITHM The first seven stages, before the LOOP section of Figure 8 show the two peeled trips through the while() loops in the C code shown above. The first BFLY0.LU and BFLY1.LU instructions compute meaningless results, so these computations are not shown. A TIE instruction that performs the latching of BFLY3.SU but without the store to memory is called OPLCH, and only performs the latches to the values needed in the next four processor cycles. In the next four processor cycles, BFLY0.LU and BFLY1.LU are executed again, followed by variants of the other TIE instructions (BFLY2, BFLY3) that do not perform stores, as there is no meaningful data yet to be stored. It is useful to note that the way that the operations are scheduled affects the requirements for the number of functional units (e.g. multipliers) and buffers needed. The decision-making involved in scheduling computations and deciding on hardware resources is better described as an art and cannot easily be described in a procedure. 16
23 4 Complete C Implementation with TIE Finally, the C code in Appendix A (the function r2_c_tie_fft) incorporates all of the TIE instructions used. The complete TIE description is also given in Appendix A. Additional instructions, such as BFLY0_LU_swap, are necessary because the final 2-point FFTs can be performed on individual values contained within a single 128-bit load. Unlike the prior FFT stages, in which values obtained from separate loads were used to compute the results, the final stage must be performed using values obtained from single loads. 5 Performance For N-point FFTs of length N = 128, 256, 512, and 1024, four different implementations are compared: a. the C implementation compiled for a base Xtensa processor using software multiplies and -O3 compiler optimization, without use of any TIE or additional hardware functional units; b. the C implementation compiled using the -O3 optimization, for a base Xtensa processor with the MUL32 functional unit (and no TIE extensions in use); c. the C implementation compiled for an Xtensa processor using the -O3 optimization as well as with TIE instructions developed for this FFT; d. Hand-coded assembly language routine structured differently from the C code implementations assembled for an Xtensa processor using the TIE instructions developed for this FFT. Table 2 shows the resulting code size and cycle counts of each implementation for different FFT input lengths, and also the performance improvement factor comparing the (d) hand-optimized assembly implementation with the (b) C code compiled on the Xtensa processor including the MUL32 unit. Note that the code size given for (a) the C implementation using software multiplies does not include the multiplication libraries required to execute the program. The code size of (b) is reduced to 430 bytes because a 32-bit multiplier in hardware is used. TABLE 2. CODE SIZE AND PERFORMANCE RESULTS a b c d E = b/d C (with software multiplies) C (using MUL32) C (with TIE) Assembly (with TIE) Performance Improvement Code Size (Bytes) 433+libraries point X Performance (cycles) 256-point X 512-point X 1024-point X 17
24 6 Conclusion Algorithms such as the FFT can be implemented using TIE instructions for the Xtensa processor. This allows for efficient execution using hardware optimizations encapsulated in instruction extensions and for efficient software development using those instructions. This application note describes an efficient implementation of N-point decimation-in-frequency FFT for Xtensa processors. The code found in Appendix A can be compiled and run to demonstrate the performance claimed. I NOTE: To build an Xtensa processor configuration to simulate or synthesize the FFT implementation, there are two requirements. First, as a fundamental requirement, the configuration must use a 128-bit wide Processor Interface (PIF). Also, the TIE and software included in this implementation support a Little Endian configuration. To use a Big Endian configuration, the TIE and code can be adjusted accordingly. IMPLEMENTATION NOTE: 18
25 7 Glossary of Terms Butterfly The term used to describe the computation that comprises the inner loop of a radix-k FFT algorithm that implements an N-point Discrete Fourier Transform. In the radix-2 implementation the butterfly computation sums input points N/2 samples apart to obtain one output, and the second output is obtained by differencing the inputs and multiplying by a twiddle factor. The term was derived from the shape of the pictorial representation. a a+b b -1 c c x (a-b) Decimation-in-frequency Of the two main types of FFT implementations of the DFT that recursively divide the problem in half, this method involves decimating the output (frequency-domain) sequence to simplify the DFT computation. DFT The Discrete Fourier Transform member of the Fourier Transform family; the DFT operates on discrete-time input and produces discrete frequency samples as output. For a common definition, see the equation in Figure 1. MUL32 The Xtensa processor option for 32-bit fixed-point multiplication integrated into the CPU and Xtensa software toolset as a set of optional instructions. N The number of input time-domain samples used in computing the FFT; also the number of resulting output frequency-domain samples. Radix-2 The method that allows expression of an N-point Discrete Fourier Transform as two N/2 - point Discrete Fourier Transforms. TIE (Tensilica Instruction Extension) Language The language that defines custom instructions that are added to the basic instruction set of an Xtensa processor. Twiddle factor In the Discrete Fourier Transform equation, these are the exponential factors which become multipliers of the x[n] input sequence to compute the output X[k] sequence. For an N -point DFT there are N 2 + N 4 + L + 2 twiddle factors whose values are j( ) n n e m m = 2π W a DFT. m = N, N 2, K,2; n = 0,1, K, m 2. Figure 2 shows how the twiddle factors are used in 19
26 Appendix A Complete Application Code This appendix includes the code described in this application note. The code includes the radix_2.c, radix_2_hand_code.s, bfly.tie, and util.h: 20
27 radix_2.c /* * FFT implementation for Xtensa Processors * Radix-2 Decimation-in-frequency * Copyright (c) Tensilica Inc. These coded instructions, statements, and computer programs are Confidential Proprietary Information of Tensilica Inc. and may not be disclosed to third parties or copied in any form, in whole or in part, without the prior written consent of Tensilica Inc */ #include <malloc.h> #include <math.h> #include <assert.h> #include <sys/types.h> #include <sys/times.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "util.h" /* Add this "# include" statment below, to include * C stub simulation of TIE instructions. * Used when when * Use this when confirming TIE instruction functionality * in native environment. * * #include "tdk/cstub-bfly.c" */ int **dif_twiddles_2; int *dif_first_twiddle; #define DEFAULT_BPT 16 static unsigned int BPT = DEFAULT_BPT; #define FIX(x, bpt) ((int) ((x) * (1 << (bpt)) + 0.5)) const double pi = ; extern void r2_tie_fft(int *data, int n, int *first_twiddle); /* Initialize Twiddle factors * * n is twice the number of complex numbers in the input. */ void init_r2_twiddles(int n, int log_2_n) { int td, i; dif_twiddles_2 = (int **) malloc((log_2_n - 1) * sizeof(int *)); dif_twiddles_2[0] = (int *) malloc((n + 4) * sizeof(int) - 1); /* Align the array by force */ if ((((unsigned) dif_twiddles_2[0]) & 0xf)!= 0) { *((unsigned *) &dif_twiddles_2[0]) += 0x10; *((unsigned *) &dif_twiddles_2[0]) &= ~0xf; n >>= 1; for (i = 0; i < n; i += 2) { double angle = -(pi * i) / n; 21
28 dif_twiddles_2[0][i + 0] = FIX(cos(angle), BPT); dif_twiddles_2[0][i + 1] = FIX(sin(angle), BPT); n >>= 1; /* Now n is the number of (data, not twiddle) complex numbers at this subproblem level. */ for (td = 1; td < log_2_n - 2; td += 1, n >>= 1) { dif_twiddles_2[td] = dif_twiddles_2[td - 1] + (n << 1); for (i = 0; i < n; i += 2) { dif_twiddles_2[td][i + 0] = dif_twiddles_2[td - 1][2 * i + 0]; dif_twiddles_2[td][i + 1] = dif_twiddles_2[td - 1][2 * i + 1]; /* Last level is special; we duplicate the twiddle factor 1 because of the SIMD load. */ dif_twiddles_2[td] = dif_twiddles_2[td - 1] + (n << 1); dif_twiddles_2[td][0] = dif_twiddles_2[td][2] = dif_twiddles_2[td - 1][0]; dif_twiddles_2[td][1] = dif_twiddles_2[td][3] = dif_twiddles_2[td - 1][1]; dif_first_twiddle = dif_twiddles_2[0]; /* Sample bit reversal computation for the * bit_reverse_complex_array() routine below */ int bit_reverse(int i, int p) { int a = 0; int j; for (j = 0; j < p; ++j) { a = (a << 1) + (i & 0x01); i = i >> 1; assert(i == 0); return a; /* Sample bit reversal routine to rearrange elements of an array 'a' * It must be the case that n == 2^p. n is the number of * (real, imaginary) complex pairs in the array a. */ void bit_reverse_complex_array(int a[], int n, int p) { int i, j; int t; for (i = 0; i < n; ++i) { j = bit_reverse(i, p); if (j > i) { /* real part */ t = a[2 * i + 0]; a[2 * i + 0] = a[2 * j + 0]; a[2 * j + 0] = t; /* imaginary part */ t = a[2 * i + 1]; a[2 * i + 1] = a[2 * j + 1]; 22
29 a[2 * j + 1] = t; static void usage(const char *const pname) { printf("usage: %s [-p <bits_of_precision>] infile\n", pname); exit(-1); #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << 15); /* C routine, unoptimized */ void r2_fft(int *data, int n, int *twiddles) { register int i; register int l = n; register int *t = twiddles; while (l > 1) { for (i = 0; i < l; i += 2) { register int *p = data + i; register int *q = p; register int wr = *(t++); register int wi = *(t++); while (p < &data[n << 1]) { register int d0r, d0i; register int d1r, d1i; register int r0r, r0i; register int r1r, r1i; register long long tr, ti; d0r = *(p + 0); d0i = *(p + 1); p += l; d1r = *(p + 0); d1i = *(p + 1); p += l; r0r = d0r + d1r; r0i = d0i + d1i; tr = d0r - d1r; ti = d0i - d1i; r1r = DESCALE(tr * wr, round) - DESCALE(ti * wi, round); r1i = DESCALE(tr * wi, round) + DESCALE(ti * wr, round); *(q + 0) = r0r; *(q + 1) = r0i; q += l; *(q + 0) = r1r; *(q + 1) = r1i; q += l; l >>= 1; 23
Double-Precision Floating Point Emulation Acceleration
Double-Precision Floating Point Emulation Acceleration Application Note Tensilica, Inc. 3255-6 Scott Blvd. Santa Clara, CA 95054 (408) 986-8000 Fax (408) 986-8919 www.tensilica.com December 2007 Doc Number:
More informationConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine
PRODUCT BRIEF ConnX D2 DSP Engine Dual-MAC, 16-bit Fixed-Point Communications DSP FEATURES BENEFITS Both SIMD and 2-way FLIX (parallel VLIW) operations Optimized, vectorizing XCC Compiler High-performance
More informationHigh-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation
High-Performance 16-Point Complex FFT April 8, 1999 Application Note This document is (c) Xilinx, Inc. 1999. No part of this file may be modified, transmitted to any third party (other than as intended
More informationRadix-4 FFT Algorithms *
OpenStax-CNX module: m107 1 Radix-4 FFT Algorithms * Douglas L Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 10 The radix-4 decimation-in-time
More informationDESIGN METHODOLOGY. 5.1 General
87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods
More informationDecimation-in-time (DIT) Radix-2 FFT *
OpenStax-CNX module: m1016 1 Decimation-in-time (DIT) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-time
More informationAN10913 DSP library for LPC1700 and LPC1300
Rev. 3 11 June 2010 Application note Document information Info Content Keywords LPC1700, LPC1300, DSP library Abstract This application note describes how to use the DSP library with the LPC1700 and LPC1300
More informationFast Fourier Transform (FFT)
EEO Prof. Fowler Fast Fourier Transform (FFT). Background The FFT is a computationally efficient algorithm for computing the discrete Fourier transform (DFT). The DFT is the mathematical entity that is
More informationAVR42789: Writing to Flash on the New tinyavr Platform Using Assembly
AVR 8-bit Microcontrollers AVR42789: Writing to Flash on the New tinyavr Platform Using Assembly APPLICATION NOTE Table of Contents 1. What has Changed...3 1.1. What This Means and How to Adapt...4 2.
More informationDigital Signal Processing. Soma Biswas
Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)
More informationCHAPTER 5. Software Implementation of FFT Using the SC3850 Core
CHAPTER 5 Software Implementation of FFT Using the SC3850 Core 1 Fast Fourier Transform (FFT) Discrete Fourier Transform (DFT) is defined by: 1 nk X k x n W, k 0,1,, 1, W e n0 Theoretical arithmetic complexity:
More informationTOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:
1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second
More information6. Fast Fourier Transform
x[] X[] x[] x[] x[6] X[] X[] X[3] x[] x[5] x[3] x[7] 3 X[] X[5] X[6] X[7] A Historical Perspective The Cooley and Tukey Fast Fourier Transform (FFT) algorithm is a turning point to the computation of DFT
More informationDecimation-in-Frequency (DIF) Radix-2 FFT *
OpenStax-CX module: m1018 1 Decimation-in-Frequency (DIF) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-frequency
More informationFR FAMILY FR60 FAMILY ISR DOUBLE EXECUTION 32-BIT MICROCONTROLLER APPLICATION NOTE. Fujitsu Microelectronics Europe Application Note
Fujitsu Microelectronics Europe Application Note MCU-AN-300025-E-V12 FR FAMILY 32-BIT MICROCONTROLLER FR60 FAMILY ISR DOUBLE EXECUTION APPLICATION NOTE Revision History Revision History Date Issue 2006-03-14
More informationThe Serial Commutator FFT
The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this
More informationTwiddle Factor Transformation for Pipelined FFT Processing
Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,
More informationAPPLICATION NOTE. AT6486: Using DIVAS on SAMC Microcontroller. SMART ARM-Based Microcontroller. Introduction. Features
APPLICATION NOTE AT6486: Using DIVAS on SAMC Microcontroller SMART ARM-Based Microcontroller Introduction DIVAS stands for Division and Square Root Accelerator. DIVAS is a brand new peripheral introduced
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationAbstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs
Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant
More informationThe Fast Fourier Transform
Chapter 7 7.1 INTRODUCTION The Fast Fourier Transform In Chap. 6 we saw that the discrete Fourier transform (DFT) could be used to perform convolutions. In this chapter we look at the computational requirements
More informationCut DSP Development Time Use C for High Performance, No Assembly Required
Cut DSP Development Time Use C for High Performance, No Assembly Required Digital signal processing (DSP) IP is increasingly required to take on complex processing tasks in signal processing-intensive
More informationWhite Paper. Floating-Point FFT Processor (IEEE 754 Single Precision) Radix 2 Core. Introduction. Parameters & Ports
White Paper Introduction Floating-Point FFT Processor (IEEE 754 Single Precision) Radix 2 Core The floating-point fast fourier transform (FFT) processor calculates FFTs with IEEE 754 single precision (1
More informationAN 464: DFT/IDFT Reference Design
Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents About the DFT/IDFT Reference Design... 3 Functional Description for the DFT/IDFT Reference Design... 4 Parameters for the
More informationAn efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients
Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.
More informationISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation
ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation UG817 (v 13.2) July 28, 2011 Xilinx is disclosing this user guide, manual, release note, and/or specification
More informationLow Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm
Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,
More informationAT40K FPGA IP Core AT40K-FFT. Features. Description
Features Decimation in frequency radix-2 FFT algorithm. 256-point transform. -bit fixed point arithmetic. Fixed scaling to avoid numeric overflow. Requires no external memory, i.e. uses on chip RAM and
More information24K FFT for 3GPP LTE RACH Detection
24K FFT for GPP LTE RACH Detection ovember 2008, version 1.0 Application ote 515 Introduction In GPP Long Term Evolution (LTE), the user equipment (UE) transmits a random access channel (RACH) on the uplink
More informationImage Compression System on an FPGA
Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................
More informationLatest Innovation For FFT implementation using RCBNS
Latest Innovation For FFT implementation using SADAF SAEED, USMAN ALI, SHAHID A. KHAN Department of Electrical Engineering COMSATS Institute of Information Technology, Abbottabad (Pakistan) Abstract: -
More informationLow-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units
Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because
More informationA Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP Applications Vinay Reddy N PG student Dept of ECE, PSG College of Technology, Coimbatore, Abstract Hema Chitra S Assistant professor Dept of ECE, PSG College of
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationHW/SW Co-Design Lab. Seminar 2 WS 2018/2019. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.
Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW Co-Design Lab Seminar WS 8/9 TU Dresden, Slide CORE FEATURES Slide corelx_hwswcd Xtensa LX ALU -bit MUL Load/Store
More informationAN FFT PROCESSOR BASED ON 16-POINT MODULE
AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,
More informationAnalysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope
Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra
More informationATAES132A Firmware Development Library. Introduction. Features. Atmel CryptoAuthentication USER GUIDE
Atmel CryptoAuthentication ATAES132A Firmware Development Library USER GUIDE Introduction This user guide describes how to use the Atmel CryptoAuthentication ATAES132A Firmware Development Library with
More informationParallel-computing approach for FFT implementation on digital signal processor (DSP)
Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm
More informationHW/SW-Codesign Lab. Seminar 2 WS 2016/2017. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.
Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW-Codesign Lab Seminar WS / TU Dresden, Slide CORE FEATURES TU Dresden HW/SW-Codesign Lab Slide corelx_hwswcd Xtensa
More informationCapital. Capital Logic Generative. v Student Workbook
Capital Capital Logic Generative v2016.1 Student Workbook 2017 Mentor Graphics Corporation All rights reserved. This document contains information that is trade secret and proprietary to Mentor Graphics
More informationFused Floating Point Arithmetic Unit for Radix 2 FFT Implementation
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic
More informationShimadzu LabSolutions Connector Plugin
Diablo EZReporter 4.0 Shimadzu LabSolutions Connector Plugin Copyright 2016, Diablo Analytical, Inc. Diablo Analytical EZReporter Software EZReporter 4.0 Shimadzu LabSolutions Connector Plugin Copyright
More informationBits, Words, and Integers
Computer Science 52 Bits, Words, and Integers Spring Semester, 2017 In this document, we look at how bits are organized into meaningful data. In particular, we will see the details of how integers are
More informationA Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor
A Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor Nasri Sulaiman and Tughrul Arslan Department of Electronics and Electrical Engineering The University of Edinburgh Scotland
More informationGemBuilder for Java Release Notes
GemStone GemBuilder for Java Release Notes Version 3.1.3 November 2016 SYSTEMS INTELLECTUAL PROPERTY OWNERSHIP This documentation is furnished for informational use only and is subject to change without
More informationThe objective of this presentation is to describe you the architectural changes of the new C66 DSP Core.
PRESENTER: Hello. The objective of this presentation is to describe you the architectural changes of the new C66 DSP Core. During this presentation, we are assuming that you're familiar with the C6000
More informationHigh Performance Pipelined Design for FFT Processor based on FPGA
High Performance Pipelined Design for FFT Processor based on FPGA A.A. Raut 1, S. M. Kate 2 1 Sinhgad Institute of Technology, Lonavala, Pune University, India 2 Sinhgad Institute of Technology, Lonavala,
More informationTeam 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One
CO200-Computer Organization and Architecture - Assignment One Note: A team may contain not more than 2 members. Format the assignment solutions in a L A TEX document. E-mail the assignment solutions PDF
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationMULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION
MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication
More informationCS6303 COMPUTER ARCHITECTURE LESSION NOTES UNIT II ARITHMETIC OPERATIONS ALU In computing an arithmetic logic unit (ALU) is a digital circuit that performs arithmetic and logical operations. The ALU is
More informationISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation
ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation UG817 (v13.3) November 11, 2011 Xilinx is disclosing this user guide, manual, release note, and/or specification (the Documentation
More informationLOW-POWER SPLIT-RADIX FFT PROCESSORS
LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT
More informationEnd User License Agreement
End User License Agreement Kyocera International, Inc. ( Kyocera ) End User License Agreement. CAREFULLY READ THE FOLLOWING TERMS AND CONDITIONS ( AGREEMENT ) BEFORE USING OR OTHERWISE ACCESSING THE SOFTWARE
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationATECC108/ATSHA204 USER GUIDE. Atmel Firmware Library. Features. Introduction
ATECC108/ATSHA204 Atmel Firmware Library USER GUIDE Features Layered and Modular Design Compact and Optimized for 8-bit Microcontrollers Easy to Port Supports I 2 C and Single-Wire Communication Distributed
More informationUNIT-II. Part-2: CENTRAL PROCESSING UNIT
Page1 UNIT-II Part-2: CENTRAL PROCESSING UNIT Stack Organization Instruction Formats Addressing Modes Data Transfer And Manipulation Program Control Reduced Instruction Set Computer (RISC) Introduction:
More informationMicroprocessor Theory
Microprocessor Theory and Applications with 68000/68020 and Pentium M. RAFIQUZZAMAN, Ph.D. Professor California State Polytechnic University Pomona, California and President Rafi Systems, Inc. WILEY A
More informationUsing Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding
Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding Version 1.2 01/99 Order Number: 243651-002 02/04/99 Information in this document is provided in connection with Intel products.
More informationFFT/IFFTProcessor IP Core Datasheet
System-on-Chip engineering FFT/IFFTProcessor IP Core Datasheet - Released - Core:120801 Doc: 130107 This page has been intentionally left blank ii Copyright reminder Copyright c 2012 by System-on-Chip
More informationREAL TIME DIGITAL SIGNAL PROCESSING
REAL TIME DIGITAL SIGAL PROCESSIG UT-FRBA www.electron.frba.utn.edu.ar/dplab UT-FRBA Frequency Analysis Fast Fourier Transform (FFT) Fast Fourier Transform DFT: complex multiplications (-) complex aditions
More informationAccelerating Nios II Systems with the C2H Compiler Tutorial
Accelerating Nios II Systems with the C2H Compiler Tutorial August 2008, Version 8.0 Tutorial Introduction The Nios II C2H Compiler is a powerful tool that generates hardware accelerators for software
More informationISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation
ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation UG817 (v 14.3) October 16, 2012 This tutorial document was last validated using the following software version: ISE Design
More informationCOMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital
Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in
More informationImplementing FIR Filters
Implementing FIR Filters in FLEX Devices February 199, ver. 1.01 Application Note 73 FIR Filter Architecture This section describes a conventional FIR filter design and how the design can be optimized
More informationTable of contents 2 / 42
NFL Prediction Model 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 2 //42 3 4 5 6 7 8 9 1 42 Table of contents Program Setup... 3 End User License Agreement...
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationFM300 Network Server
FM300 Network Server User s Manual March 2005 MEDA, Inc Macintyre Electronic Design Associates, Inc 43676 Trade Center Place, Suite 145 Dulles, VA 20166 Disclaimer of Warranty FM300 Network Server NO WARRANTIES
More informationGPA Migration Guide
Diablo BTU Calculator 2.0 GPA 2145-09 Migration Guide Copyright 2008, Diablo Analytical, Inc. Diablo Analytical BTU Calculator 2.0 Software GPA 2145-09 Migration Guide Copyright 2008, Diablo Analytical,
More informationLinköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs
Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point
More informationParallel FIR Filters. Chapter 5
Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture
More informationAPPLICATION NOTE. Atmel AT03261: SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) SAM D20 System Interrupt Driver (SYSTEM INTERRUPT)
APPLICATION NOTE Atmel AT03261: SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) ASF PROGRAMMERS MANUAL SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) This driver for SAM D20 devices provides an
More informationUsing LPC11Axx EEPROM (with IAP)
Rev. 2 1 July 2012 Application note Document information Info Content Keywords LPC11A02UK ; LPC11A04UK; LPC11A11FHN33; LPC11A12FHN33; LPC11A12FBD48; LPC11A13FHI33; LPC11A14FHN33; LPC11A14FBD48; LPC11Axx,
More informationSchematic Capture Lab 1
Schematic Capture Lab 1 PADS Schematic Design Environment and Workspace Schematic Capture Lab 1: PADS Schematic Design Environment and Workspace Your PADS Schematic Design environment starts when you select
More information1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM
1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM 1.1 Introduction Given that digital logic and memory devices are based on two electrical states (on and off), it is natural to use a number
More informationFujitsu Microelectronics Europe Application Note MCU-AN E-V10 FR FAMILY 32-BIT MICROCONTROLLER MB91460 REAL TIME CLOCK APPLICATION NOTE
Fujitsu Microelectronics Europe Application Note MCU-AN-300075-E-V10 FR FAMILY 32-BIT MICROCONTROLLER MB91460 REAL TIME CLOCK APPLICATION NOTE Revision History Revision History Date 2008-06-05 First Version;
More informationMile Terms of Use. Effective Date: February, Version 1.1 Feb 2018 [ Mile ] Mileico.com
Mile Terms of Use Effective Date: February, 2018 Version 1.1 Feb 2018 [ Mile ] Overview The following are the terms of an agreement between you and MILE. By accessing, or using this Web site, you acknowledge
More information1. License Grant; Related Provisions.
IMPORTANT: READ THIS AGREEMENT CAREFULLY. THIS IS A LEGAL AGREEMENT BETWEEN AVG TECHNOLOGIES CY, Ltd. ( AVG TECHNOLOGIES ) AND YOU (ACTING AS AN INDIVIDUAL OR, IF APPLICABLE, ON BEHALF OF THE INDIVIDUAL
More informationMyCreditChain Terms of Use
MyCreditChain Terms of Use Date: February 1, 2018 Overview The following are the terms of an agreement between you and MYCREDITCHAIN. By accessing, or using this Web site, you acknowledge that you have
More informationAccelDSP Synthesis Tool
AccelDSP Synthesis Tool Release Notes R R Xilinx is disclosing this Document and Intellectual Property (hereinafter the Design ) to you for use in the development of designs to operate on, or interface
More informationINTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM
Course Outline Course Outline INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Introduction Fast Fourier Transforms have revolutionized digital signal processing What is the FFT? A collection of tricks
More informationEthernet1 Xplained Pro
Ethernet1 Xplained Pro Part Number: ATETHERNET1-XPRO The Atmel Ethernet1 Xplained Pro is an extension board to the Atmel Xplained Pro evaluation platform. The board enables the user to experiment with
More informationSetting up the DR Series System on Acronis Backup & Recovery v11.5. Technical White Paper
Setting up the DR Series System on Acronis Backup & Recovery v11.5 Technical White Paper Quest Engineering November 2017 2017 Quest Software Inc. ALL RIGHTS RESERVED. THIS WHITE PAPER IS FOR INFORMATIONAL
More informationCapital. Capital Logic Aero. v Student Workbook
Capital v2018.1 Student Workbook 2019 Mentor Graphics Corporation All rights reserved. This document contains information that is trade secret and proprietary to Mentor Graphics Corporation or its licensors
More informationExcerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997
Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 APPENDIX A.1 Number systems and codes Since ten-fingered humans are addicted to the decimal system, and since computers
More informationFFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW
FFT There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies A X = A + BW B Y = A BW B. Baas 442 FFT Dataflow Diagram Dataflow
More informationDesign of Delay Efficient Distributed Arithmetic Based Split Radix FFT
Design of Delay Efficient Arithmetic Based Split Radix FFT Nisha Laguri #1, K. Anusudha *2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India
More informationRapidIO TM Interconnect Specification Part 7: System and Device Inter-operability Specification
RapidIO TM Interconnect Specification Part 7: System and Device Inter-operability Specification Rev. 1.3, 06/2005 Copyright RapidIO Trade Association RapidIO Trade Association Revision History Revision
More informationRapidIO Interconnect Specification Part 3: Common Transport Specification
RapidIO Interconnect Specification Part 3: Common Transport Specification Rev. 1.3, 06/2005 Copyright RapidIO Trade Association RapidIO Trade Association Revision History Revision Description Date 1.1
More informationAT03262: SAM D/R/L/C System Pin Multiplexer (SYSTEM PINMUX) Driver. Introduction. SMART ARM-based Microcontrollers APPLICATION NOTE
SMART ARM-based Microcontrollers AT03262: SAM D/R/L/C System Pin Multiplexer (SYSTEM PINMUX) Driver APPLICATION NOTE Introduction This driver for Atmel SMART ARM -based microcontrollers provides an interface
More informationSensView User Guide. Version 1.0 February 8, Copyright 2010 SENSR LLC. All Rights Reserved. R V1.0
SensView User Guide Version 1.0 February 8, 2010 Copyright 2010 SENSR LLC. All Rights Reserved. R001-419-V1.0 TABLE OF CONTENTS 1 PREAMBLE 3 1.1 Software License Agreement 3 2 INSTALLING SENSVIEW 5 2.1
More informationUsing MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding
Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided
More informationPerformance Analysis of Line Echo Cancellation Implementation Using TMS320C6201
Performance Analysis of Line Echo Cancellation Implementation Using TMS320C6201 Application Report: SPRA421 Zhaohong Zhang and Gunter Schmer Digital Signal Processing Solutions March 1998 IMPORTANT NOTICE
More informationAutomator (Standard)
Automator (Standard) DLL Users Guide Available exclusively from PC Control Ltd. www.pc-control.co.uk 2017 Copyright PC Control Ltd. Revision 1.2 Contents 1. Introduction 2. DLL Reference 3. Using the DLL
More informationUsing MMX Instructions to implement 2X 8-bit Image Scaling
Using MMX Instructions to implement 2X 8-bit Image Scaling Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection with
More informationAn Efficient Vector/Matrix Multiply Routine using MMX Technology
An Efficient Vector/Matrix Multiply Routine using MMX Technology Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection
More informationUsing a Scalable Parallel 2D FFT for Image Enhancement
Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for
More informationArithmetic Processing
CS/EE 5830/6830 VLSI ARCHITECTURE Chapter 1 Basic Number Representations and Arithmetic Algorithms Arithmetic Processing AP = (operands, operation, results, conditions, singularities) Operands are: Set
More information