Implementing the Fast Fourier Transform for the Xtensa Processor

Size: px
Start display at page:

Download "Implementing the Fast Fourier Transform for the Xtensa Processor"

Transcription

1 Implementing the Fast Fourier Transform for the Xtensa Processor Application Note Tensilica, Inc Scott Blvd. Santa Clara, CA (408) Fax (408) November 2005 Doc Number: AN

2 2005 Tensilica, Inc. Printed in the United States of America All Rights Reserved This publication is provided AS IS. Tensilica, Inc. (hereafter Tensilica ) does not make any warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Information in this document is provided solely to enable system and software developers to use Tensilica processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual property rights or licenses granted hereunder to design or fabricate Tensilica integrated circuits or integrated circuits based on the information in this document. Tensilica does not warrant that the contents of this publication, whether individually or as one or more groups, meets your requirements or that the publication is error-free. This publication could include technical inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated in new editions of this publication. Tensilica is a registered trademark of Tensilica, Inc. The following terms are trademarks of Tensilica, Inc.: OSKit, Vectra, and Xtensa. All other trademarks and registered trademarks are the property of their respective companies. NO LIABILITY FOR CONSEQUENTIAL DAMAGES.TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT SHALL Tensilica BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING WITHOUT LIMITATION, SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR INDIRECT DAMAGES FOR PERSONAL INJURY, LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, OR ANY OTHER PECUNIARY LOSS) ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PRODUCT CODE, EVEN IF MANUFACTURER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Document Change History: First Published, February 2001, revised 3/2005 Updated for LX/Xtensa 6, 11/2005 ii

3 Contents 1 Introduction Basic C Implementation Development of TIE Language Instructions from the C Algorithm...5 General Cases for Loop Pipelining...5 Case Case TIE Development Steps Complete C Implementation with TIE Performance Conclusion Glossary of Terms...19 Appendix A Complete Application Code...20 radix_2.c...21 radix_2_hand_code.s...27 bfly.tie...30 util.h...37 util.c...38 Appendix B Application Output...40 Appendix C Import and use of workspace...41 iii

4 Figures Figure 1. Definition of the Discrete Fourier Transform... 1 Figure 2. Decimation-in-Frequency DFT in Terms of Even- and Odd-numbered Frequency Samples 2 Figure 3. FFT Decimation-in-Frequency Algorithm... 3 Figure 4. Basic Implementation of Radix-2 Decimation-in-Frequency FFT Algorithm... 4 Figure 5. Complex Butterfly Computation... 5 Figure 6. First Step in Converting FFT Algorithm into TIE Implementation... 9 Figure 7. Showing Direct Correspondence of C Code to TIE Language Instructions Figure 8. TIE Language Instructions for the FFT Algorithm Tables Table 1. Performance Results of Xtensa FFT Implementation... 1 Table 2. Code Size and Performance Results iv

5 Abstract The goal of this application note is to show the results and design methodology for a highperformance DSP sample application on the Xtensa microprocessor using a widely known example, the Fast Fourier Transform (FFT). This note first explains the basic algorithm and how several TIE language instructions were created to implement the FFT algorithm. Performance results follow, with a comparison of implementations of the radix-2 decimation-in-frequency FFT with and without additional TIE language extensions. This application note assumes the reader has a basic understanding of the ASIC design methodology used with Xtensa processors and some familiarity with digital signal processing. Preface The Implementing the Fast Fourier Transform for the Xtensa Processor application note was LX and Xtensa 6 release of the Xtensa processor. Location Old: New: Appendix A radix_2.c /* Add this "# include" statement below, to include * C stub simulation of TIE instructions. * Used when when * Use this when confirming TIE instruction functionality * in native environment. * * #include "tdk/cstub-bfly.c" */ #include "tdk/cstub-bfly.c" Location Appendix A util.c, end of source Old: return 0; New: return 0; Additional Notes: Section 4 Complete C Implementation with TIE (Addendum) The testbench requires an input file that describes the time-domain complex samples for the FFT algorithm to process. The input file is an ASCII file. The first line of this file states the number of floating point data on the following lines of the file. Proceeding lines will contain the floating point data samples. The testbench will automatically convert the floating point sample to 16bit fixed point format. For example, a 256-point complex data set will have 256 real and 256 imaginary samples. Therefore, 512 data points are listed. The samples are given as floating point numbers from 1 to Complex pairs are listed as real sample first and imaginary sample second. Each sample is listed on a new line. v

6 An example of the ASCII input file for a square wave is shown below: (pattern repeats 63 more times) Build and exercise the FFT testbench for the Xtensa LX/6 processor. Import the workspace into Xplorer and select the Run or Profile button. Alternately, enter the following commands after the configuration is installed and the environment is properly set up: >tc d tdk bfly.tie >set XTENSA_PARAMS=.\tdk (or setenv XTENSA_PARAMS./tdk, for Unix/Linux) >xt-gcc O3 radix_2.c lm util.c rasix_2_hand_code.s o fft >xt-run --pipe fft <input_file> vi

7 1 Introduction One of the most widely used algorithms in digital signal processing applications is the Fast Fourier Transform (FFT), which covers a family of techniques for computing the Discrete Fourier Transform (DFT), as defined in the equation in Figure 1. The FFT is a large class of algorithms applicable to a variety of signal processing applications such as audio, video, and speech. For the purpose of this application note, one of these algorithms is highlighted for its simplicity, and is used to show how to achieve performance improvements on Xtensa processors. X[ k] = 1 N N 1 n= 0 x[ n] e 2π j( ) nk N k = 0,1, K, N 1 FIGURE 1. DEFINITION OF THE DISCRETE FOURIER TRANSFORM. In Figure 1, X [k] are the N frequency samples being computed, x[n] are the time-domain input samples, n is the index of time-domain samples, and k is the index of the frequencydomain samples. Also, j is defined such that j = 1 following a common convention. Note 2 that for convenience, the 1 N scaling factor is omitted in the Xtensa implementation described in this document. The purpose of this application note is to explain an optimized implementation of the FFT for the Xtensa processor using Tensilica Instruction Extension (TIE) language constructs. The TIE constructs are optimized instructions that are integrated into the Xtensa processor pipeline. Using the TIE language involves considering what operations can benefit most from hardware optimization. This application note will show that optimizing the FFT butterfly operations with TIE language hardware extensions attains results competitive with high-performance DSP implementations of FFT. The following table summarizes the performance obtained by using the C code and handoptimized assembly code, both using TIE language instructions. This application note focuses on TIE language development for the case of the FFT algorithm implemented in C. The development of the assembly implementation is not discussed in detail in this application note, as it incorporates the same TIE instructions as the C implementation. The performance improvement of the assembly code is attributable to its special structure, which differs from that of the C implementation. In addition to the TIE optimization we applied IPA (Interprocedural Analysis) and Profile Directed Compilation. Both of these optimizations are Xtensaspecific optimizations leveraging from the Xtensa tool chain. TABLE 1. PERFORMANCE RESULTS OF XTENSA FFT IMPLEMENTATION With TIE C Assembly Code Size (Bytes) Performance 128-point (cycles) 256-point point point For a complete background and detailed derivations of the various FFT algorithms, refer to one of the many textbooks on the subject (for example, Discrete-Time Signal Processing by Oppenheim, et al., Prentice-Hall, 1999). Also, please note that Tensilica offers a set of FFT software libraries for the Vectra TM DSP Engine, available as a coprocessor option to the Xtensa processor. The Vectra engine is optimized for a broad array of fixed-point arithmetic functions. The techniques described in this 1

8 application note offer a dedicated solution for acceleration of the FFT algorithm with TIE instructions. To understand which approach is most appropriate for your application, please refer to the Vectra DSP Engine User s Guide, or consult an Applications Engineer of Tensilica, Inc. 2 Basic C Implementation When N is divisible by 2, the N -point DFT can be recursively decomposed into two N 2 -point DFTs. This decomposition is the basis for what is commonly described as a radix-2 FFT algorithm. Additionally, there are two canonical ways to perform a decomposition into two DFTs. One way, known as decimation-in-time, uses the even and odd time-domain samples (decimated sequences) as input to the two DFTs. The second approach is to decompose into two DFTs such that their results are the even and odd frequency domain sample sequences, or decimated output sequences this is known as decimation-in-frequency. The decimation-in-frequency decomposition of the DFT simplifies the DFT summation, defined previously in Figure 1, by decimating the frequency domain samples by 2. The summation is split into two summations across the even-numbered frequency samples X[ 2k] for k = 0,1, K, N 2 1, as well as the odd-numbered samples X [ 2k + 1], for k = 0,1, K, N 2 1. The decomposition is simplified into two equations that represent the DFT results in the even and odd cases; the two equations are given in Figure 2: ( N 2) X[2r] = ( N 2) X[2r + 1] = 1 n= 0 1 n= 0 ( x[ n] + x[ n + ( N 2) ] ) ( x[ n] x[ n + ( N 2) ]) e e 2π j( ) n N 2π j( ) nr N 2 e 2π j( ) nr N 2 ( 2) 1 r = 0,1, K, N ( 2) 1 r = 0,1, K, N FIGURE 2. DECIMATION-IN-FREQUENCY DFT IN TERMS OF EVEN- AND ODD-NUMBERED FREQUENCY SAMPLES 2 j( π ) n n N It is convenient to define a set of factors W N e n 0,1, K, N 2 for the computation of the odd-numbered samples in the DFT decomposition. As shown in Figure 2, performing the decomposition involves additions, subtractions, and multiplications by the = over the set { factors W. Additional decomposition stages require the factorsw 2, When n N n WN 4, n WN 8, and so on. N is a power of 2, then the DFT can be implemented by log ( N ) such recursive 2 j( ) n = π n decompositions. The quantities W e m m m = N, N 2, K,2; n = 0,1, K, m 2 required in the FFT stages are referred to as twiddle factors. Any stage can be implemented by the original definition (in Figure 1) or by a decomposition. In the case that all log 2 ( N ) stages are implemented by the decimation-in-frequency decomposition, then each stage requires only adds, subtracts, and multiplies by twiddle factors. Figure 3 depicts the radix-2 decimation-in-frequency FFT. This paper will focus on implementing an N -point decimation-in-frequency FFT computation for complex input values. 2 2

9 N/2 point DFT W N 1 W N 2 W N W 3 N N/2 point DFT W N N/2-2 W N N/2-1 FIGURE 3. FFT DECIMATION-IN-FREQUENCY ALGORITHM Our FFT algorithm implementations use the following conventions: The complex N -point input to the FFT is organized as an array of 2N components, with elements alternating in correspondence to real and imaginary components of the input N -point sequence. The twiddle factors are stored as a set of arrays used as input to the FFT algorithm: 0 N N N 4 1 { W N L WN ; WN 2 LWN 2 ; WN 4 LWN 4 ; K The real and imaginary components of the twiddle factors alternate in the sequence. Each successive sequence contains half as many twiddle factors as the previous one. The computation assumes a 24-bit fixed-point representation with 16 places to the right of the binary point. The basic algorithm implemented in C is shown in Figure 4. 3

10 #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << (15)) void r2_fft_basic(int *data, int n, int *twiddles) { register int i, d0r, d0i, d1r, d1i, r0r, r0i, r1r, r1i; register int *p, *q, wr, wi; register long long tr, ti; register int m = n; register int *t = twiddles; while (m > 1) { /* Outer loop (Decomposition into stages)*/ for (i = 0; i < m; i += 2) { /* Middle loop */ p = data + i; q = p; wr = *(t++); wi = *(t++); while (q < &data[n << 1]) { /* Inner loop */ d0r = *(p + 0); d0i = *(p + 1); p += m; d1r = *(p + 0); d1i = *(p + 1); p += m; r0r = d0r + d1r; r0i = d0i + d1i; tr = d0r - d1r; ti = d0i - d1i; r1r = DESCALE(tr * wr, round) - DESCALE(ti * wi, round); r1i = DESCALE(tr * wi, round) + DESCALE(ti * wr, round); *(q + 0) = r0r; *(q + 1) = r0i; q += m; *(q + 0) = r1r; *(q + 1) = r1i; q += m; m >>= 1; FIGURE 4. BASIC IMPLEMENTATION OF RADIX-2 DECIMATION-IN-FREQUENCY FFT ALGORITHM Here, the focus is on the series of radix-2 decompositions of an N -point DFT in which N is a power of 2; there are log 2 ( N) stages of butterfly operations being performed on the input. At the top of the algorithm, the DESCALE macro is defined both to keep 24-bits of accuracy in the result of a multiply operation and also to maintain consistency in precision between the C implementation and the Xtensa implementation that uses hardware defined in the TIE language, to give equally accurate results. By adding the round value to a 64-bit integer, then scaling the sum by 2-16, effectively the 64-bit integer is converted into a fixed-point value, rounded to the least-significant bit. log 2 successive stages of DFT computations. The for() called the middle loop initializes both p to point to the first data element of the current stage and also q to point to the same data element which will be overwritten with the output data. The outer while() loop counts the ( N ) 4

11 The innermost while() loop, called the inner loop, traverses the input elements for the current butterfly stage. For each stage h, where { 1, 2,, log 2 ( N ) 1 h K, there are 2 h 1 subproblems, that is, DFTs of length N 2 each. Each subproblem has a set of N 2 h+ butterflies that use two values each as input and compute two output values. The real and imaginary parts of the butterfly computation are shown in Figure 5. h ar sr = ar+br ai si = ai+bi br -1 cr sr = cr(ar-br) bi -1 cr si = ci(ai-bi) a FIGURE 5. COMPLEX BUTTERFLY COMPUTATION In each stage, however, each pass through the inner loop computes one set of butterfly results per subproblem. This order of computation allows for reuse of the same twiddle factor for all subproblems in the same FFT stage. Once the final group of butterflies are computed for all 2h subproblems, the next stage is h + 1, as m is further divided by 2 to prepare for another trip through the outer while() loop. 3 Development of TIE Language Instructions from the C Algorithm This section explains the function of the TIE language implementation. TIE instructions encapsulate functionality of the algorithm in hardware. For more detail on the TIE language basics, refer to the Tensilica Instruction Extension (TIE) Language User s Guide. The goal of this document is to show the functionality of the TIE implementation of the FFT algorithm by converting the simple C implementation, step-by-step, into an efficient C routine that uses TIE instructions. The goal for the TIE implementation of the radix-2 FFT is to achieve maximal performance with a reasonable hardware cost. TIE allows for performing an arbitrary number of parallel operations in a single instruction. However, it is not possible to perform more than one load or store per instruction or per cycle. Therefore, one general TIE programming technique is to fold computation operation into TIE load or store instructions. If every computational instruction is combined with a load or store instruction, a load or store instruction will be issued on each cycle. Without algorithmic changes, such an approach will lead to an optimal number of issued instructions. For many algorithms including the radix-2 FFT, it is possible to fold all computational instructions into load or store instructions with reasonable hardware costs. Even with minimizing the number of instructions, folded instructions might have large latencies that are either infeasible to build or lead to stalls in the pipeline. Adding hardware resources can eliminate this problem. New data can be loaded in parallel with computations done on old data, and even older results can be stored in parallel with the current computation. As long as different iterations of a loop do not depend on each other, this technique comes at the cost of additional hardware to buffer the results. This technique is popular in software and, in that context, is called software pipelining; here the pipeline is being managed in hardware. General Cases for Loop Pipelining To begin optimizing the FFT inner loop with TIE instructions, it makes sense to look at ways to optimize load-operation-store loops in general. This loop structure is applicable because the 5

12 inner loop of the general FFT algorithm in Figure 4 consists of loads, then computation, then stores. The loads bring in the new ar, ai, br, and bi, four input values. The computation creates four output values, rr, r1, sr, and si, and the outputs are stored at the end of the loop body. While the Xtensa architecture issues at most one load or store per processor cycle, the TIE instructions can be written with computations in parallel, limited by the amount of compute hardware available. Unrolling the load-operation-store loop offers the advantage of storing a previous result and computing the next result in parallel. The following two cases find the maximum latency allowed for computation. Case 1 Case 1 presents a load-operation-store pipeline loop with two inputs and one output and a computation latency of n cycles load 2 A load 2 B comp 1 C store 1 C compute 2 C store 2 C load 3 A load 3 B compute 3 C store 3 C load 4 A load 4 B 1 cycle 2 cycles 3 cycles comp C compute C compute C In this diagram, the loop is software pipelined and the loop iteration is indicated by subscripts. The first two loads of values A and B are not shown, but the second two loads (load2) are followed by a store instruction for the previous result (store1). Two more loads follow, which gives up to three cycles during which the latency of a compute operation can be hidden. Hence, for n=1, n=2, or n=3 cycles, there is no cycle-count penalty for performing a computation in parallel with the loads and stores. Thus, TIE instructions with appropriate loop pipelining can be used to perform computations in parallel with the loads and stores. Case 2 Case 2 presents a load-operation-store pipeline loop with two inputs, two outputs, and a computation latency of n cycles. comp 1 C,D store 1 C store 1 D load 2 A load 2 B compute 2 C,D store 2 C store 2 D load 3 A load 3 B compute 3 C,D load 4 A load 4 B load 5 A load 5 B store 3 C store 3 D compute 4 C,D compute 2 C compute 2 D 2 cycles 2 cycles 6 In this case the goal is to compute two results, which requires two store instructions. Software pipelining this loop requires two cycles for the loads and two cycles for the stores; hence, it is possible to perform in parallel two computations whose latency is two cycles each. If the computation is on more than one set of input data, efficient use of the hardware can be made while hiding latency. This case is similar to the FFT implementation of this application note

13 which uses a 2-cycle multiplier in each cycle on two sets of real and imaginary input (for 4 results) in the 4 slots available. The loop optimization described sets the tone for the way chosen to convert the general C algorithm into an efficient FFT implementation for an Xtensa processor with TIE. TIE Development Steps There are other optimization techniques applied to the FFT implementation. A technique called SIMD (Single Instruction Multiple Data) can accommodate simultaneous pipelining of independent computations. As Xtensa provides for 128-bit memory accesses, wide loads of four integer values at a time can address the need for two sets of two inputs for each butterfly. Two 128-bit loads best utilizes the available hardware to perform two butterfly computations. The structure of loads and stores is SIMD in that the real and imaginary components of two consecutive inputs can be loaded in one cycle. Similarly, the components of two complex outputs can be stored in one cycle, allowing the latency of two butterfly computations to be hidden. Figure 6 shows the result of several changes made to the basic C algorithm to create an optimized algorithm using TIE. First, the middle loop is unrolled and the bodies of the two resulting inner loop copies are combined. Duplicated variables perform two complex butterfly computations in the inner loop body. The following list summarizes the copies of variables: Variable c0r, coi, c1r, c1i a0r, a0i, a1r, a1i b0r, b0i, b1r, b1i r0r, r0i, r1r, r1i t0r, t0i, t1r, t1i s0r, s0i, s1r, s1i Meaning Twiddle factors (two sets of real and imaginary twiddle factors from adjacent elements of the twiddle factor array twiddles) Input data Input data at the alternate butterfly wing. b0 and b1 are each m samples away from the corresponding input data a0 and a1 Results of butterfly computations (upper wing which gives the sum of the butterfly inputs) Intermediate values used to find outputs to current butterfly stage (lower wing) Results of butterfly computations (lower wing which gives product of twiddle factor and difference of inputs) It is necessary to peel off the last trip through the outer loop because when N = 2 (the last stage of radix-2 decompositions), the butterfly computations are performed on consecutive input data. In prior butterfly computations, it is sufficient to use one address to load a0 r + j a0i and a1 r + j a1i, whose real and imaginary components were all consecutive in memory, and another address to load the values corresponding to b0 r + j b0i and b1 r + j b1i. But in the last trip, looking at four consecutive complex values gives a0 r + j a0i and b0 r + j b0i since the two inputs to each butterfly correspond to consecutive output values from the previous stage. This special characteristic of the last trip explains why the values of a0r, a0i, b0r, and b0i are loaded in this different order in the last trip. 7

14 #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << 15); void r2_fft_step1(int *data, int n, int *twiddles) { register int i, c0r, c0i, c1r, c1i, a0r, a0i, a1r, a1i; register int b0r, b0i, b1r, b1i, r0r, r0i, r1r, r1i, s0r, s0i, s1r, s1i; register long long t0r, t0i, t1r, t1i; register int m = n; register int *t = twiddles; register int *p, *q; while (m > 2) { /* Outer loop (Decomposition into stages)*/ for (i = 0; i < m; i += 4) { /* Middle loop */ p = data + i; q = p; c0r = *(t++); /* SIMD: c[0:3] = t[0:3] */ c0i = *(t++); c1r = *(t++); c1i = *(t++); while (q < &data[n << 1]) { /* Inner loop */ a0r = *(p + 0); /* SIMD: a[0:3] = p[0:3] */ a0i = *(p + 1); a1r = *(p + 2); a1i = *(p + 3); p += m; b0r = *(p + 0); /* SIMD: b[0:3] = p[0:3] */ b0i = *(p + 1); b1r = *(p + 2); b1i = *(p + 3); p += m; r0r = a0r + b0r; r0i = a0i + b0i; t0r = a0r - b0r; t0i = a0i - b0i; s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); r1r = a1r + b1r; r1i = a1i + b1i; t1r = a1r - b1r; t1i = a1i - b1i; s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); *(q + 0) = r0r; /* SIMD: q[0:3] = r[0:3] */ *(q + 1) = r0i; *(q + 2) = r1r; *(q + 3) = r1i; q += m; *(q + 0) = s0r; /* SIMD: q[0:3] = s[0:3] */ *(q + 1) = s0i; *(q + 2) = s1r; 8

15 *(q + 3) = s1i; q += m; m >>= 1; /* Final trip through Middle loop (case m = 2) */ p = data; q = p; c0r = *(t++); c0i = *(t++); c1r = *(t++); c1i = *(t++); while (q < &data[n << 1]) { /* Inner loop */ a0r = *(p + 0); a0i = *(p + 1); b0r = *(p + 2); /* b0r,b0i swapped with a1r,a1i */ b0i = *(p + 3); p += 4; a1r = *(p + 0); a1i = *(p + 1); b1r = *(p + 2); b1i = *(p + 3); p += 4; r0r = a0r + b0r; r0i = a0i + b0i; t0r = a0r - b0r; t0i = a0i - b0i; s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); r1r = a1r + b1r; r1i = a1i + b1i; t1r = a1r - b1r; t1i = a1i - b1i; s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); *(q + 0) = r0r; *(q + 1) = r0i; *(q + 2) = s0r; *(q + 3) = s0i; q += 4; *(q + 0) = r1r; *(q + 1) = r1i; *(q + 2) = s1r; *(q + 3) = s1i; q += 4; FIGURE 6. FIRST STEP IN CONVERTING FFT ALGORITHM INTO TIE IMPLEMENTATION With two sets of real and imaginary results being computed in each trip through the inner loop, the computation can now be pipelined. The goal is an implementation with loads or stores in every cycle and compute operations in parallel: 9

16 store store compute compute compute compute store store load load compute compute compute compute store store load load compute compute compute compute load load When pipelining computations with load/store operations, TIE state registers are required to buffer data retrieved from memory and store to memory. Meaningful computations can be performed on state registers only after values have been loaded into them; hence, new data is loaded into TIE state, and computations on older data can be performed on data previously latched from those state (input buffers) into different state registers. With these TIE considerations in mind, variables are needed that will act as load buffers into which the load first brings input data. Additionally, variables are created to act as store buffers that receive the results of the compute operations before they are stored to memory. Figure 7 shows an implementation in which additional buffer variables have been added. NOTE: These variables will directly correspond to TIE state registers; loads will move data directly from memory into TIE state, compute operations acting on TIE state can act independently of loads and stores, and stores move values from TIE state back into memory. In each iteration of the inner loop, the load buffers can hold the next four input values, or in other words, the real and imaginary components of the FFT input values a0 r + j a0i and a1 r + j a1i. This implementation will use 24-bit TIE state registers, and with a processor interface (PIF) that is 128 bits wide, loads are issued in two consecutive cycles to retrieve the values corresponding to two butterfly inputs. In our C implementation the variables a0r_load, a0i_load, a1r_load, a1i_load and also b0r_load, b0i_load, b1r_load, and b1i_load are used to represent the load buffers. In another two processor cycles, the already-computed FFT butterfly-stage outputs r0r, r0i, s0r, s0i, r1r, r1i, s1r, and s1i can be stored. There are also assignment instructions (after the outputs are computed) that latch the load buffer registers into the variables required for the computation. In the load-compute-store pipeline being implemented, the latching prepares for the next computation cycle. The direct correspondence between the C code given and the TIE implementation is shown as comments in Figure 7. The four basic instructions, BFLY0.LU, BFLY1.LU, BFLY2.SU, and BFLY3.SU have been created with variants used to perform the first two trips through the inner loop to prime the pipeline. In Figure 7 the inner loop is peeled to prime the pipeline. Also, the last trip through the middle loop for the case m=2 is peeled because the inputs to the butterfly computations on this pass are consecutive in memory, as discussed above. Before explaining the details of each TIE instruction, this paper examines how the C code for the FFT is written so that it is ready to have a direct correspondence with TIE; this is the essence of TIE development. It is useful to understand the TIE definition and processor restrictions that characterize how an instruction extension can be defined for the Xtensa processor. The Xtensa processor restricts the number of memory operations per cycle to one, but the width of the Processor Interface (PIF) determines the throughput. With a PIF width limit of 128 bits, a throughput of two sets of complex butterfly inputs in two cycles is achievable; the throughput of stores is the same. In turn, the data throughput determines how much computation must be accomplished, and moving the data into load buffers defined as TIE state registers allows as much computation as is necessary. 10

17 void The limitation of one load or store per cycle determines the number of cycles required to do one iteration of the computation. In turn, the number of cycles affects the amount of other hardware resources required. For example, in the FFT implementation each SIMD iteration (two original iterations) requires 8 multiplies and will execute in 4 cycles. Hence two multipliers (one SIMD multiplier) are required. In turn, the latency of the operations determining how many loop iterations ahead loads should happen ahead of the computation. In this FFT implementation, the two loads are placed one iteration ahead of the computation, and the two stores happen one iteration later than the computation. r2_fft_step3(int *data, int n, int *twiddles) { register int i, c0r, c0i, c1r, c1i, a0r, a0i, a1r, a1i; register int b0r, b0i, b1r, b1i, r0r, r0i, r1r, r1i; register long long t0r, t0i, t1r, t1i; register int s0r, s0i, s1r, s1i; register int a0r_load, a0i_load, a1r_load, a1i_load; register int b0r_load, b0i_load, b1r_load, b1i_load; register int r0r_store, r0i_store, r1r_store, r1i_store; register int s0r_store, s0i_store, s1r_store, s1i_store; register int m = n; register int *t = twiddles; register int *p, *q; while (m > 2) { /* Outer loop (Decomposition into stages) */ for (i = 0; i < m; i += 4) { /* Middle loop */ p = data + i; q = p; c0r = *(t++); /* LDC_LU */ c0i = *(t++); /* LDC_LU */ c1r = *(t++); /* LDC_LU */ c1i = *(t++); /* LDC_LU */ /* While loop below has been peeled so that the first 2 trips through * the loop are simplified. */ /* The first peeled trip through the loop is simplified so that * it doesn't perform arithmetic or store results */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; /* BFLY0_LU */ b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; /* BFLY1_LU */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* OPLCH */ 11

18 b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* OPLCH */ /* The second peeled trip through the loop is simplified so that * it doesn't store results */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; r0r = a0r + b0r; /* BFLY0 */ r0i = a0i + b0i; /* BFLY1 */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0, BFLY1 */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0 */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1 */ r1r = a1r + b1r; /* BFLY2 */ r1i = a1i + b1i; /* BFLY3 */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2, BFLY3 */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2 */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3 */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3 */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3 */ while (q < &data[n << 1]) { /* Inner loop */ a0r_load = *(p + 0); /* BFLY0_LU */ a0i_load = *(p + 1); /* BFLY0_LU */ a1r_load = *(p + 2); /* BFLY0_LU */ a1i_load = *(p + 3); /* BFLY0_LU */ p += m; /* BFLY0_LU */ b0r_load = *(p + 0); /* BFLY1_LU */ b0i_load = *(p + 1); /* BFLY1_LU */ b1r_load = *(p + 2); /* BFLY1_LU */ b1i_load = *(p + 3); /* BFLY1_LU */ p += m; /* BFLY1_LU */ r0r_store = r0r; s0r_store = s0r; r1r_store = r1r; s1r_store = s1r; /* BFLY0_LU */ r0i_store = r0i; s0i_store = s0i; /* BFLY1_LU */ r1i_store = r1i; s1i_store = s1i; /* BFLY3_SU */ r0r = a0r + b0r; /* BFLY0_LU */ r0i = a0i + b0i; /* BFLY1_LU */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU, BFLY1_LU */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU */ 12

19 r1r = a1r + b1r; /* BFLY2_SU */ r1i = a1i + b1i; /* BFLY3_SU */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_SU, BFLY3_SU */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_SU */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_SU */ a0r =a0r_load; a0i =a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3_SU */ b0r =b0r_load; b0i =b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_SU */ *(q + 0) = r0r_store; /* BFLY2_SU */ *(q + 1) = r0i_store; /* BFLY2_SU */ *(q + 2) = r1r_store; /* BFLY2_SU */ *(q + 3) = r1i_store; /* BFLY2_SU */ q += m; /* BFLY2_SU */ *(q + 0) = s0r_store; /* BFLY3_SU */ *(q + 1) = s0i_store; /* BFLY3_SU */ *(q + 2) = s1r_store; /* BFLY3_SU */ *(q + 3) = s1i_store; /* BFLY3_SU */ q += m; /* BFLY3_SU */ m >>= 1; /* Final trip through Middle Loop (m=2) */ p = data; q = p; c0r = *(t++); /* LDC_LU */ c0i = *(t++); /* LDC_LU */ c1r = *(t++); /* LDC_LU */ c1i = *(t++); /* LDC_LU */ /* While loop below has been peeled so that the first 2 trips through * the loop are simplified. */ /* The first peeled trip through the loop is simplified so that * it doesn't perform arithmetic or store results */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* OPLCH */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* OPLCH */ 13

20 /* The second peeled trip through the loop is simplified so that * it doesn't store results */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ r0r = a0r + b0r; /* BFLY0_LU_SWAP */ r0i = a0i + b0i; /* BFLY1_LU_SWAP */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU_SWAP, BFLY1_LU_SWAP */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU_SWAP */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU_SWAP */ r1r = a1r + b1r; /* BFLY2_LU_SWAP */ r1i = a1i + b1i; /* BFLY3_LU_SWAP */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_LU_SWAP, BFLY3_LU_SWAP */ s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_LU_SWAP */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_LU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i = a1i_load; /* BFLY3_LU_SWAP */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_LU_SWAP */ while (q < &data[n << 1]) { /* Inner loop */ a0r_load = *(p + 0); /* BFLY0_LU_SWAP */ a0i_load = *(p + 1); /* BFLY0_LU_SWAP */ b0r_load = *(p + 2); /* BFLY0_LU_SWAP */ b0i_load = *(p + 3); /* BFLY0_LU_SWAP */ p += 4; /* BFLY0_LU_SWAP */ a1r_load = *(p + 0); /* BFLY1_LU_SWAP */ a1i_load = *(p + 1); /* BFLY1_LU_SWAP */ b1r_load = *(p + 2); /* BFLY1_LU_SWAP */ b1i_load = *(p + 3); /* BFLY1_LU_SWAP */ p += 4; /* BFLY1_LU_SWAP */ r0r_store = r0r; s0r_store = s0r; r1r_store = r1r; s1r_store = s1r; /* BFLY0_LU_SWAP */ r0i_store = r0i; s0i_store = s0i; /* BFLY1_LU_SWAP */ r1i_store = r1i; s1i_store = s1i; /* BFLY3_SU_SWAP */ r0r = a0r + b0r; /* BFLY0_LU_SWAP */ r0i = a0i + b0i; /* BFLY1_LU_SWAP */ t0r = a0r - b0r; t0i = a0i - b0i; /* BFLY0_LU_SWAP, BFLY1_LU_SWAP */ s0r = DESCALE(t0r * c0r, round) - DESCALE(t0i * c0i, round); /* BFLY0_LU_SWAP */ s0i = DESCALE(t0r * c0i, round) + DESCALE(t0i * c0r, round); /* BFLY1_LU_SWAP */ r1r = a1r + b1r; /* BFLY2_SU_SWAP */ r1i = a1i + b1i; /* BFLY3_SU_SWAP */ t1r = a1r - b1r; t1i = a1i - b1i; /* BFLY2_SU_SWAP, BFLY3_SU_SWAP */ 14

21 s1r = DESCALE(t1r * c1r, round) - DESCALE(t1i * c1i, round); /* BFLY2_SU_SWAP */ s1i = DESCALE(t1r * c1i, round) + DESCALE(t1i * c1r, round); /* BFLY3_SU_SWAP */ a0r = a0r_load; a0i = a0i_load; a1r = a1r_load; a1i =a1i_load; /* BFLY3_SU_SWAP */ b0r = b0r_load; b0i = b0i_load; b1r = b1r_load; b1i = b1i_load; /* BFLY3_SU_SWAP*/ *(q + 0) = r0r_store; /* BFLY2_SU_SWAP */ *(q + 1) = r0i_store; /* BFLY2_SU_SWAP */ *(q + 2) = s0r_store; /* BFLY2_SU_SWAP */ *(q + 3) = s0i_store; /* BFLY2_SU_SWAP */ q += 4; /* BFLY2_SU_SWAP */ *(q + 0) = r1r_store; /* BFLY3_SU_SWAP */ *(q + 1) = r1i_store; /* BFLY3_SU_SWAP */ *(q + 2) = s1r_store; /* BFLY3_SU_SWAP */ *(q + 3) = s1i_store; /* BFLY3_SU_SWAP */ q += 4; /* BFLY3_SU_SWAP */ FIGURE 7. SHOWING DIRECT CORRESPONDENCE OF C CODE TO TIE LANGUAGE INSTRUCTIONS Figure 8 shows the succession of each TIE instruction in the order of execution. The diagram shows the pipeline of loads, arithmetic instructions, and stores. Each of the four main TIE instructions that form the core butterfly instructions in the bracket labeled LOOP performs a load or store concurrent with the computation of both an r and s result. To understand the tasks in each TIE instruction, please refer to the LOOP label in Figure 8. The loads of new values into a butterfly computation are handled by BFLY0.LU and BFLY1.LU, in two consecutive processor cycles. The computation is handled in successive cycles by all four instructions, in which the BFLY0.LU and BFLY1.LU instructions perform their computation concurrently with the loads. BFLY0.LU and BFLY1.LU move previously computed butterfly results into the various store buffers. And in the fourth cycle of the LOOP, the BFLY3.SU instruction latches the previously loaded load buffers in preparation for computing the next set of results. 15

22 Achieves about one complex butterfly computation per cycle BFLY0.LU Load Buffers a0, a1 BFLY1.LU Load Buffers b0, b1 (STALL) OPLCH Latch Load Buffers to a0r, a0i, b0r, b0i a1r, a1i, b1r, b1i BFLY0.LU Load Buffers a0, a1 Compute r0r, s0r r0r r0r_st r1r r1r_st s0r s0r_st s1r s1r_st BFLY1.LU Load Buffers b0, b1 Compute r0i, s0i r0i r0i_st s0i s0i_st BFLY2 Compute r1r, s1r BFLY3 Latch Load Buffers to a0r, a0i, b0r, b0i a1r, a1i, b1r, b1i Compute, Move r1i r1i_st s1i s1i_st BFLY0.LU Load Buffers a0, a1 Compute r0r, s0r r0r r0r_st r1r r1r_st s0r s0r_st s1r s1r_st BFLY1.LU BFLY2.SU Load Buffers b0, b1 Compute r0i, s0i Compute r1r, s1r r0i r0i_st s0i s0i_st Store r0r_st r0i_st r1r_st r1i_st LOOP BFLY3.SU Latch Load Buffers to a0r, a0i, b0r, b0i, a1r, a1i, b1r, b1i Compute, Move r1i r1i_st s1i s1i_st Store s0r_st s0i_st s1r_st s1i_st FIGURE 8. TIE LANGUAGE INSTRUCTIONS FOR THE FFT ALGORITHM The first seven stages, before the LOOP section of Figure 8 show the two peeled trips through the while() loops in the C code shown above. The first BFLY0.LU and BFLY1.LU instructions compute meaningless results, so these computations are not shown. A TIE instruction that performs the latching of BFLY3.SU but without the store to memory is called OPLCH, and only performs the latches to the values needed in the next four processor cycles. In the next four processor cycles, BFLY0.LU and BFLY1.LU are executed again, followed by variants of the other TIE instructions (BFLY2, BFLY3) that do not perform stores, as there is no meaningful data yet to be stored. It is useful to note that the way that the operations are scheduled affects the requirements for the number of functional units (e.g. multipliers) and buffers needed. The decision-making involved in scheduling computations and deciding on hardware resources is better described as an art and cannot easily be described in a procedure. 16

23 4 Complete C Implementation with TIE Finally, the C code in Appendix A (the function r2_c_tie_fft) incorporates all of the TIE instructions used. The complete TIE description is also given in Appendix A. Additional instructions, such as BFLY0_LU_swap, are necessary because the final 2-point FFTs can be performed on individual values contained within a single 128-bit load. Unlike the prior FFT stages, in which values obtained from separate loads were used to compute the results, the final stage must be performed using values obtained from single loads. 5 Performance For N-point FFTs of length N = 128, 256, 512, and 1024, four different implementations are compared: a. the C implementation compiled for a base Xtensa processor using software multiplies and -O3 compiler optimization, without use of any TIE or additional hardware functional units; b. the C implementation compiled using the -O3 optimization, for a base Xtensa processor with the MUL32 functional unit (and no TIE extensions in use); c. the C implementation compiled for an Xtensa processor using the -O3 optimization as well as with TIE instructions developed for this FFT; d. Hand-coded assembly language routine structured differently from the C code implementations assembled for an Xtensa processor using the TIE instructions developed for this FFT. Table 2 shows the resulting code size and cycle counts of each implementation for different FFT input lengths, and also the performance improvement factor comparing the (d) hand-optimized assembly implementation with the (b) C code compiled on the Xtensa processor including the MUL32 unit. Note that the code size given for (a) the C implementation using software multiplies does not include the multiplication libraries required to execute the program. The code size of (b) is reduced to 430 bytes because a 32-bit multiplier in hardware is used. TABLE 2. CODE SIZE AND PERFORMANCE RESULTS a b c d E = b/d C (with software multiplies) C (using MUL32) C (with TIE) Assembly (with TIE) Performance Improvement Code Size (Bytes) 433+libraries point X Performance (cycles) 256-point X 512-point X 1024-point X 17

24 6 Conclusion Algorithms such as the FFT can be implemented using TIE instructions for the Xtensa processor. This allows for efficient execution using hardware optimizations encapsulated in instruction extensions and for efficient software development using those instructions. This application note describes an efficient implementation of N-point decimation-in-frequency FFT for Xtensa processors. The code found in Appendix A can be compiled and run to demonstrate the performance claimed. I NOTE: To build an Xtensa processor configuration to simulate or synthesize the FFT implementation, there are two requirements. First, as a fundamental requirement, the configuration must use a 128-bit wide Processor Interface (PIF). Also, the TIE and software included in this implementation support a Little Endian configuration. To use a Big Endian configuration, the TIE and code can be adjusted accordingly. IMPLEMENTATION NOTE: 18

25 7 Glossary of Terms Butterfly The term used to describe the computation that comprises the inner loop of a radix-k FFT algorithm that implements an N-point Discrete Fourier Transform. In the radix-2 implementation the butterfly computation sums input points N/2 samples apart to obtain one output, and the second output is obtained by differencing the inputs and multiplying by a twiddle factor. The term was derived from the shape of the pictorial representation. a a+b b -1 c c x (a-b) Decimation-in-frequency Of the two main types of FFT implementations of the DFT that recursively divide the problem in half, this method involves decimating the output (frequency-domain) sequence to simplify the DFT computation. DFT The Discrete Fourier Transform member of the Fourier Transform family; the DFT operates on discrete-time input and produces discrete frequency samples as output. For a common definition, see the equation in Figure 1. MUL32 The Xtensa processor option for 32-bit fixed-point multiplication integrated into the CPU and Xtensa software toolset as a set of optional instructions. N The number of input time-domain samples used in computing the FFT; also the number of resulting output frequency-domain samples. Radix-2 The method that allows expression of an N-point Discrete Fourier Transform as two N/2 - point Discrete Fourier Transforms. TIE (Tensilica Instruction Extension) Language The language that defines custom instructions that are added to the basic instruction set of an Xtensa processor. Twiddle factor In the Discrete Fourier Transform equation, these are the exponential factors which become multipliers of the x[n] input sequence to compute the output X[k] sequence. For an N -point DFT there are N 2 + N 4 + L + 2 twiddle factors whose values are j( ) n n e m m = 2π W a DFT. m = N, N 2, K,2; n = 0,1, K, m 2. Figure 2 shows how the twiddle factors are used in 19

26 Appendix A Complete Application Code This appendix includes the code described in this application note. The code includes the radix_2.c, radix_2_hand_code.s, bfly.tie, and util.h: 20

27 radix_2.c /* * FFT implementation for Xtensa Processors * Radix-2 Decimation-in-frequency * Copyright (c) Tensilica Inc. These coded instructions, statements, and computer programs are Confidential Proprietary Information of Tensilica Inc. and may not be disclosed to third parties or copied in any form, in whole or in part, without the prior written consent of Tensilica Inc */ #include <malloc.h> #include <math.h> #include <assert.h> #include <sys/types.h> #include <sys/times.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include "util.h" /* Add this "# include" statment below, to include * C stub simulation of TIE instructions. * Used when when * Use this when confirming TIE instruction functionality * in native environment. * * #include "tdk/cstub-bfly.c" */ int **dif_twiddles_2; int *dif_first_twiddle; #define DEFAULT_BPT 16 static unsigned int BPT = DEFAULT_BPT; #define FIX(x, bpt) ((int) ((x) * (1 << (bpt)) + 0.5)) const double pi = ; extern void r2_tie_fft(int *data, int n, int *first_twiddle); /* Initialize Twiddle factors * * n is twice the number of complex numbers in the input. */ void init_r2_twiddles(int n, int log_2_n) { int td, i; dif_twiddles_2 = (int **) malloc((log_2_n - 1) * sizeof(int *)); dif_twiddles_2[0] = (int *) malloc((n + 4) * sizeof(int) - 1); /* Align the array by force */ if ((((unsigned) dif_twiddles_2[0]) & 0xf)!= 0) { *((unsigned *) &dif_twiddles_2[0]) += 0x10; *((unsigned *) &dif_twiddles_2[0]) &= ~0xf; n >>= 1; for (i = 0; i < n; i += 2) { double angle = -(pi * i) / n; 21

28 dif_twiddles_2[0][i + 0] = FIX(cos(angle), BPT); dif_twiddles_2[0][i + 1] = FIX(sin(angle), BPT); n >>= 1; /* Now n is the number of (data, not twiddle) complex numbers at this subproblem level. */ for (td = 1; td < log_2_n - 2; td += 1, n >>= 1) { dif_twiddles_2[td] = dif_twiddles_2[td - 1] + (n << 1); for (i = 0; i < n; i += 2) { dif_twiddles_2[td][i + 0] = dif_twiddles_2[td - 1][2 * i + 0]; dif_twiddles_2[td][i + 1] = dif_twiddles_2[td - 1][2 * i + 1]; /* Last level is special; we duplicate the twiddle factor 1 because of the SIMD load. */ dif_twiddles_2[td] = dif_twiddles_2[td - 1] + (n << 1); dif_twiddles_2[td][0] = dif_twiddles_2[td][2] = dif_twiddles_2[td - 1][0]; dif_twiddles_2[td][1] = dif_twiddles_2[td][3] = dif_twiddles_2[td - 1][1]; dif_first_twiddle = dif_twiddles_2[0]; /* Sample bit reversal computation for the * bit_reverse_complex_array() routine below */ int bit_reverse(int i, int p) { int a = 0; int j; for (j = 0; j < p; ++j) { a = (a << 1) + (i & 0x01); i = i >> 1; assert(i == 0); return a; /* Sample bit reversal routine to rearrange elements of an array 'a' * It must be the case that n == 2^p. n is the number of * (real, imaginary) complex pairs in the array a. */ void bit_reverse_complex_array(int a[], int n, int p) { int i, j; int t; for (i = 0; i < n; ++i) { j = bit_reverse(i, p); if (j > i) { /* real part */ t = a[2 * i + 0]; a[2 * i + 0] = a[2 * j + 0]; a[2 * j + 0] = t; /* imaginary part */ t = a[2 * i + 1]; a[2 * i + 1] = a[2 * j + 1]; 22

29 a[2 * j + 1] = t; static void usage(const char *const pname) { printf("usage: %s [-p <bits_of_precision>] infile\n", pname); exit(-1); #define DESCALE(x, r) ((int) (((x) + (r)) >> 16)) const int round = (1 << 15); /* C routine, unoptimized */ void r2_fft(int *data, int n, int *twiddles) { register int i; register int l = n; register int *t = twiddles; while (l > 1) { for (i = 0; i < l; i += 2) { register int *p = data + i; register int *q = p; register int wr = *(t++); register int wi = *(t++); while (p < &data[n << 1]) { register int d0r, d0i; register int d1r, d1i; register int r0r, r0i; register int r1r, r1i; register long long tr, ti; d0r = *(p + 0); d0i = *(p + 1); p += l; d1r = *(p + 0); d1i = *(p + 1); p += l; r0r = d0r + d1r; r0i = d0i + d1i; tr = d0r - d1r; ti = d0i - d1i; r1r = DESCALE(tr * wr, round) - DESCALE(ti * wi, round); r1i = DESCALE(tr * wi, round) + DESCALE(ti * wr, round); *(q + 0) = r0r; *(q + 1) = r0i; q += l; *(q + 0) = r1r; *(q + 1) = r1i; q += l; l >>= 1; 23

Double-Precision Floating Point Emulation Acceleration

Double-Precision Floating Point Emulation Acceleration Double-Precision Floating Point Emulation Acceleration Application Note Tensilica, Inc. 3255-6 Scott Blvd. Santa Clara, CA 95054 (408) 986-8000 Fax (408) 986-8919 www.tensilica.com December 2007 Doc Number:

More information

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine PRODUCT BRIEF ConnX D2 DSP Engine Dual-MAC, 16-bit Fixed-Point Communications DSP FEATURES BENEFITS Both SIMD and 2-way FLIX (parallel VLIW) operations Optimized, vectorizing XCC Compiler High-performance

More information

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation

High-Performance 16-Point Complex FFT Features 1 Functional Description 2 Theory of Operation High-Performance 16-Point Complex FFT April 8, 1999 Application Note This document is (c) Xilinx, Inc. 1999. No part of this file may be modified, transmitted to any third party (other than as intended

More information

Radix-4 FFT Algorithms *

Radix-4 FFT Algorithms * OpenStax-CNX module: m107 1 Radix-4 FFT Algorithms * Douglas L Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 10 The radix-4 decimation-in-time

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

Decimation-in-time (DIT) Radix-2 FFT *

Decimation-in-time (DIT) Radix-2 FFT * OpenStax-CNX module: m1016 1 Decimation-in-time (DIT) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-time

More information

AN10913 DSP library for LPC1700 and LPC1300

AN10913 DSP library for LPC1700 and LPC1300 Rev. 3 11 June 2010 Application note Document information Info Content Keywords LPC1700, LPC1300, DSP library Abstract This application note describes how to use the DSP library with the LPC1700 and LPC1300

More information

Fast Fourier Transform (FFT)

Fast Fourier Transform (FFT) EEO Prof. Fowler Fast Fourier Transform (FFT). Background The FFT is a computationally efficient algorithm for computing the discrete Fourier transform (DFT). The DFT is the mathematical entity that is

More information

AVR42789: Writing to Flash on the New tinyavr Platform Using Assembly

AVR42789: Writing to Flash on the New tinyavr Platform Using Assembly AVR 8-bit Microcontrollers AVR42789: Writing to Flash on the New tinyavr Platform Using Assembly APPLICATION NOTE Table of Contents 1. What has Changed...3 1.1. What This Means and How to Adapt...4 2.

More information

Digital Signal Processing. Soma Biswas

Digital Signal Processing. Soma Biswas Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)

More information

CHAPTER 5. Software Implementation of FFT Using the SC3850 Core

CHAPTER 5. Software Implementation of FFT Using the SC3850 Core CHAPTER 5 Software Implementation of FFT Using the SC3850 Core 1 Fast Fourier Transform (FFT) Discrete Fourier Transform (DFT) is defined by: 1 nk X k x n W, k 0,1,, 1, W e n0 Theoretical arithmetic complexity:

More information

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work:

TOPICS PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) DISCRETE FOURIER TRANSFORM (DFT) INVERSE DFT (IDFT) Consulted work: 1 PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT) Consulted work: Chiueh, T.D. and P.Y. Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons Asia, (2007). Second

More information

6. Fast Fourier Transform

6. Fast Fourier Transform x[] X[] x[] x[] x[6] X[] X[] X[3] x[] x[5] x[3] x[7] 3 X[] X[5] X[6] X[7] A Historical Perspective The Cooley and Tukey Fast Fourier Transform (FFT) algorithm is a turning point to the computation of DFT

More information

Decimation-in-Frequency (DIF) Radix-2 FFT *

Decimation-in-Frequency (DIF) Radix-2 FFT * OpenStax-CX module: m1018 1 Decimation-in-Frequency (DIF) Radix- FFT * Douglas L. Jones This work is produced by OpenStax-CX and licensed under the Creative Commons Attribution License 1.0 The radix- decimation-in-frequency

More information

FR FAMILY FR60 FAMILY ISR DOUBLE EXECUTION 32-BIT MICROCONTROLLER APPLICATION NOTE. Fujitsu Microelectronics Europe Application Note

FR FAMILY FR60 FAMILY ISR DOUBLE EXECUTION 32-BIT MICROCONTROLLER APPLICATION NOTE. Fujitsu Microelectronics Europe Application Note Fujitsu Microelectronics Europe Application Note MCU-AN-300025-E-V12 FR FAMILY 32-BIT MICROCONTROLLER FR60 FAMILY ISR DOUBLE EXECUTION APPLICATION NOTE Revision History Revision History Date Issue 2006-03-14

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

APPLICATION NOTE. AT6486: Using DIVAS on SAMC Microcontroller. SMART ARM-Based Microcontroller. Introduction. Features

APPLICATION NOTE. AT6486: Using DIVAS on SAMC Microcontroller. SMART ARM-Based Microcontroller. Introduction. Features APPLICATION NOTE AT6486: Using DIVAS on SAMC Microcontroller SMART ARM-Based Microcontroller Introduction DIVAS stands for Division and Square Root Accelerator. DIVAS is a brand new peripheral introduced

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant

More information

The Fast Fourier Transform

The Fast Fourier Transform Chapter 7 7.1 INTRODUCTION The Fast Fourier Transform In Chap. 6 we saw that the discrete Fourier transform (DFT) could be used to perform convolutions. In this chapter we look at the computational requirements

More information

Cut DSP Development Time Use C for High Performance, No Assembly Required

Cut DSP Development Time Use C for High Performance, No Assembly Required Cut DSP Development Time Use C for High Performance, No Assembly Required Digital signal processing (DSP) IP is increasingly required to take on complex processing tasks in signal processing-intensive

More information

White Paper. Floating-Point FFT Processor (IEEE 754 Single Precision) Radix 2 Core. Introduction. Parameters & Ports

White Paper. Floating-Point FFT Processor (IEEE 754 Single Precision) Radix 2 Core. Introduction. Parameters & Ports White Paper Introduction Floating-Point FFT Processor (IEEE 754 Single Precision) Radix 2 Core The floating-point fast fourier transform (FFT) processor calculates FFTs with IEEE 754 single precision (1

More information

AN 464: DFT/IDFT Reference Design

AN 464: DFT/IDFT Reference Design Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents About the DFT/IDFT Reference Design... 3 Functional Description for the DFT/IDFT Reference Design... 4 Parameters for the

More information

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients Title An efficient multiplierless approximation of the fast Fourier transm using sum-of-powers-of-two (SOPOT) coefficients Author(s) Chan, SC; Yiu, PM Citation Ieee Signal Processing Letters, 2002, v.

More information

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation UG817 (v 13.2) July 28, 2011 Xilinx is disclosing this user guide, manual, release note, and/or specification

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

AT40K FPGA IP Core AT40K-FFT. Features. Description

AT40K FPGA IP Core AT40K-FFT. Features. Description Features Decimation in frequency radix-2 FFT algorithm. 256-point transform. -bit fixed point arithmetic. Fixed scaling to avoid numeric overflow. Requires no external memory, i.e. uses on chip RAM and

More information

24K FFT for 3GPP LTE RACH Detection

24K FFT for 3GPP LTE RACH Detection 24K FFT for GPP LTE RACH Detection ovember 2008, version 1.0 Application ote 515 Introduction In GPP Long Term Evolution (LTE), the user equipment (UE) transmits a random access channel (RACH) on the uplink

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

Latest Innovation For FFT implementation using RCBNS

Latest Innovation For FFT implementation using RCBNS Latest Innovation For FFT implementation using SADAF SAEED, USMAN ALI, SHAHID A. KHAN Department of Electrical Engineering COMSATS Institute of Information Technology, Abbottabad (Pakistan) Abstract: -

More information

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units

Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units Abstract: Split-radix fast Fourier transform (SRFFT) is an ideal candidate for the implementation of a lowpower FFT processor, because

More information

A Pipelined Fused Processing Unit for DSP Applications

A Pipelined Fused Processing Unit for DSP Applications A Pipelined Fused Processing Unit for DSP Applications Vinay Reddy N PG student Dept of ECE, PSG College of Technology, Coimbatore, Abstract Hema Chitra S Assistant professor Dept of ECE, PSG College of

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

HW/SW Co-Design Lab. Seminar 2 WS 2018/2019. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.

HW/SW Co-Design Lab. Seminar 2 WS 2018/2019. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW Co-Design Lab Seminar WS 8/9 TU Dresden, Slide CORE FEATURES Slide corelx_hwswcd Xtensa LX ALU -bit MUL Load/Store

More information

AN FFT PROCESSOR BASED ON 16-POINT MODULE

AN FFT PROCESSOR BASED ON 16-POINT MODULE AN FFT PROCESSOR BASED ON 6-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-58 8 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se,

More information

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope G. Mohana Durga 1, D.V.R. Mohan 2 1 M.Tech Student, 2 Professor, Department of ECE, SRKR Engineering College, Bhimavaram, Andhra

More information

ATAES132A Firmware Development Library. Introduction. Features. Atmel CryptoAuthentication USER GUIDE

ATAES132A Firmware Development Library. Introduction. Features. Atmel CryptoAuthentication USER GUIDE Atmel CryptoAuthentication ATAES132A Firmware Development Library USER GUIDE Introduction This user guide describes how to use the Atmel CryptoAuthentication ATAES132A Firmware Development Library with

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

HW/SW-Codesign Lab. Seminar 2 WS 2016/2017. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.

HW/SW-Codesign Lab. Seminar 2 WS 2016/2017. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW-Codesign Lab Seminar WS / TU Dresden, Slide CORE FEATURES TU Dresden HW/SW-Codesign Lab Slide corelx_hwswcd Xtensa

More information

Capital. Capital Logic Generative. v Student Workbook

Capital. Capital Logic Generative. v Student Workbook Capital Capital Logic Generative v2016.1 Student Workbook 2017 Mentor Graphics Corporation All rights reserved. This document contains information that is trade secret and proprietary to Mentor Graphics

More information

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic

More information

Shimadzu LabSolutions Connector Plugin

Shimadzu LabSolutions Connector Plugin Diablo EZReporter 4.0 Shimadzu LabSolutions Connector Plugin Copyright 2016, Diablo Analytical, Inc. Diablo Analytical EZReporter Software EZReporter 4.0 Shimadzu LabSolutions Connector Plugin Copyright

More information

Bits, Words, and Integers

Bits, Words, and Integers Computer Science 52 Bits, Words, and Integers Spring Semester, 2017 In this document, we look at how bits are organized into meaningful data. In particular, we will see the details of how integers are

More information

A Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor

A Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor A Genetic Algorithm for the Optimisation of a Reconfigurable Pipelined FFT Processor Nasri Sulaiman and Tughrul Arslan Department of Electronics and Electrical Engineering The University of Edinburgh Scotland

More information

GemBuilder for Java Release Notes

GemBuilder for Java Release Notes GemStone GemBuilder for Java Release Notes Version 3.1.3 November 2016 SYSTEMS INTELLECTUAL PROPERTY OWNERSHIP This documentation is furnished for informational use only and is subject to change without

More information

The objective of this presentation is to describe you the architectural changes of the new C66 DSP Core.

The objective of this presentation is to describe you the architectural changes of the new C66 DSP Core. PRESENTER: Hello. The objective of this presentation is to describe you the architectural changes of the new C66 DSP Core. During this presentation, we are assuming that you're familiar with the C6000

More information

High Performance Pipelined Design for FFT Processor based on FPGA

High Performance Pipelined Design for FFT Processor based on FPGA High Performance Pipelined Design for FFT Processor based on FPGA A.A. Raut 1, S. M. Kate 2 1 Sinhgad Institute of Technology, Lonavala, Pune University, India 2 Sinhgad Institute of Technology, Lonavala,

More information

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One

Team 1. Common Questions to all Teams. Team 2. Team 3. CO200-Computer Organization and Architecture - Assignment One CO200-Computer Organization and Architecture - Assignment One Note: A team may contain not more than 2 members. Format the assignment solutions in a L A TEX document. E-mail the assignment solutions PDF

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION

MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION MULTIPLIERLESS HIGH PERFORMANCE FFT COMPUTATION Maheshwari.U 1, Josephine Sugan Priya. 2, 1 PG Student, Dept Of Communication Systems Engg, Idhaya Engg. College For Women, 2 Asst Prof, Dept Of Communication

More information

CS6303 COMPUTER ARCHITECTURE LESSION NOTES UNIT II ARITHMETIC OPERATIONS ALU In computing an arithmetic logic unit (ALU) is a digital circuit that performs arithmetic and logical operations. The ALU is

More information

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation UG817 (v13.3) November 11, 2011 Xilinx is disclosing this user guide, manual, release note, and/or specification (the Documentation

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

End User License Agreement

End User License Agreement End User License Agreement Kyocera International, Inc. ( Kyocera ) End User License Agreement. CAREFULLY READ THE FOLLOWING TERMS AND CONDITIONS ( AGREEMENT ) BEFORE USING OR OTHERWISE ACCESSING THE SOFTWARE

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

ATECC108/ATSHA204 USER GUIDE. Atmel Firmware Library. Features. Introduction

ATECC108/ATSHA204 USER GUIDE. Atmel Firmware Library. Features. Introduction ATECC108/ATSHA204 Atmel Firmware Library USER GUIDE Features Layered and Modular Design Compact and Optimized for 8-bit Microcontrollers Easy to Port Supports I 2 C and Single-Wire Communication Distributed

More information

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

UNIT-II. Part-2: CENTRAL PROCESSING UNIT Page1 UNIT-II Part-2: CENTRAL PROCESSING UNIT Stack Organization Instruction Formats Addressing Modes Data Transfer And Manipulation Program Control Reduced Instruction Set Computer (RISC) Introduction:

More information

Microprocessor Theory

Microprocessor Theory Microprocessor Theory and Applications with 68000/68020 and Pentium M. RAFIQUZZAMAN, Ph.D. Professor California State Polytechnic University Pomona, California and President Rafi Systems, Inc. WILEY A

More information

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding Version 1.2 01/99 Order Number: 243651-002 02/04/99 Information in this document is provided in connection with Intel products.

More information

FFT/IFFTProcessor IP Core Datasheet

FFT/IFFTProcessor IP Core Datasheet System-on-Chip engineering FFT/IFFTProcessor IP Core Datasheet - Released - Core:120801 Doc: 130107 This page has been intentionally left blank ii Copyright reminder Copyright c 2012 by System-on-Chip

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGAL PROCESSIG UT-FRBA www.electron.frba.utn.edu.ar/dplab UT-FRBA Frequency Analysis Fast Fourier Transform (FFT) Fast Fourier Transform DFT: complex multiplications (-) complex aditions

More information

Accelerating Nios II Systems with the C2H Compiler Tutorial

Accelerating Nios II Systems with the C2H Compiler Tutorial Accelerating Nios II Systems with the C2H Compiler Tutorial August 2008, Version 8.0 Tutorial Introduction The Nios II C2H Compiler is a powerful tool that generates hardware accelerators for software

More information

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation UG817 (v 14.3) October 16, 2012 This tutorial document was last validated using the following software version: ISE Design

More information

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in

More information

Implementing FIR Filters

Implementing FIR Filters Implementing FIR Filters in FLEX Devices February 199, ver. 1.01 Application Note 73 FIR Filter Architecture This section describes a conventional FIR filter design and how the design can be optimized

More information

Table of contents 2 / 42

Table of contents 2 / 42 NFL Prediction Model 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 2 //42 3 4 5 6 7 8 9 1 42 Table of contents Program Setup... 3 End User License Agreement...

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

FM300 Network Server

FM300 Network Server FM300 Network Server User s Manual March 2005 MEDA, Inc Macintyre Electronic Design Associates, Inc 43676 Trade Center Place, Suite 145 Dulles, VA 20166 Disclaimer of Warranty FM300 Network Server NO WARRANTIES

More information

GPA Migration Guide

GPA Migration Guide Diablo BTU Calculator 2.0 GPA 2145-09 Migration Guide Copyright 2008, Diablo Analytical, Inc. Diablo Analytical BTU Calculator 2.0 Software GPA 2145-09 Migration Guide Copyright 2008, Diablo Analytical,

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

APPLICATION NOTE. Atmel AT03261: SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) SAM D20 System Interrupt Driver (SYSTEM INTERRUPT)

APPLICATION NOTE. Atmel AT03261: SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) APPLICATION NOTE Atmel AT03261: SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) ASF PROGRAMMERS MANUAL SAM D20 System Interrupt Driver (SYSTEM INTERRUPT) This driver for SAM D20 devices provides an

More information

Using LPC11Axx EEPROM (with IAP)

Using LPC11Axx EEPROM (with IAP) Rev. 2 1 July 2012 Application note Document information Info Content Keywords LPC11A02UK ; LPC11A04UK; LPC11A11FHN33; LPC11A12FHN33; LPC11A12FBD48; LPC11A13FHI33; LPC11A14FHN33; LPC11A14FBD48; LPC11Axx,

More information

Schematic Capture Lab 1

Schematic Capture Lab 1 Schematic Capture Lab 1 PADS Schematic Design Environment and Workspace Schematic Capture Lab 1: PADS Schematic Design Environment and Workspace Your PADS Schematic Design environment starts when you select

More information

1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM

1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM 1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM 1.1 Introduction Given that digital logic and memory devices are based on two electrical states (on and off), it is natural to use a number

More information

Fujitsu Microelectronics Europe Application Note MCU-AN E-V10 FR FAMILY 32-BIT MICROCONTROLLER MB91460 REAL TIME CLOCK APPLICATION NOTE

Fujitsu Microelectronics Europe Application Note MCU-AN E-V10 FR FAMILY 32-BIT MICROCONTROLLER MB91460 REAL TIME CLOCK APPLICATION NOTE Fujitsu Microelectronics Europe Application Note MCU-AN-300075-E-V10 FR FAMILY 32-BIT MICROCONTROLLER MB91460 REAL TIME CLOCK APPLICATION NOTE Revision History Revision History Date 2008-06-05 First Version;

More information

Mile Terms of Use. Effective Date: February, Version 1.1 Feb 2018 [ Mile ] Mileico.com

Mile Terms of Use. Effective Date: February, Version 1.1 Feb 2018 [ Mile ] Mileico.com Mile Terms of Use Effective Date: February, 2018 Version 1.1 Feb 2018 [ Mile ] Overview The following are the terms of an agreement between you and MILE. By accessing, or using this Web site, you acknowledge

More information

1. License Grant; Related Provisions.

1. License Grant; Related Provisions. IMPORTANT: READ THIS AGREEMENT CAREFULLY. THIS IS A LEGAL AGREEMENT BETWEEN AVG TECHNOLOGIES CY, Ltd. ( AVG TECHNOLOGIES ) AND YOU (ACTING AS AN INDIVIDUAL OR, IF APPLICABLE, ON BEHALF OF THE INDIVIDUAL

More information

MyCreditChain Terms of Use

MyCreditChain Terms of Use MyCreditChain Terms of Use Date: February 1, 2018 Overview The following are the terms of an agreement between you and MYCREDITCHAIN. By accessing, or using this Web site, you acknowledge that you have

More information

AccelDSP Synthesis Tool

AccelDSP Synthesis Tool AccelDSP Synthesis Tool Release Notes R R Xilinx is disclosing this Document and Intellectual Property (hereinafter the Design ) to you for use in the development of designs to operate on, or interface

More information

INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM

INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Course Outline Course Outline INTRODUCTION TO THE FAST FOURIER TRANSFORM ALGORITHM Introduction Fast Fourier Transforms have revolutionized digital signal processing What is the FFT? A collection of tricks

More information

Ethernet1 Xplained Pro

Ethernet1 Xplained Pro Ethernet1 Xplained Pro Part Number: ATETHERNET1-XPRO The Atmel Ethernet1 Xplained Pro is an extension board to the Atmel Xplained Pro evaluation platform. The board enables the user to experiment with

More information

Setting up the DR Series System on Acronis Backup & Recovery v11.5. Technical White Paper

Setting up the DR Series System on Acronis Backup & Recovery v11.5. Technical White Paper Setting up the DR Series System on Acronis Backup & Recovery v11.5 Technical White Paper Quest Engineering November 2017 2017 Quest Software Inc. ALL RIGHTS RESERVED. THIS WHITE PAPER IS FOR INFORMATIONAL

More information

Capital. Capital Logic Aero. v Student Workbook

Capital. Capital Logic Aero. v Student Workbook Capital v2018.1 Student Workbook 2019 Mentor Graphics Corporation All rights reserved. This document contains information that is trade secret and proprietary to Mentor Graphics Corporation or its licensors

More information

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 APPENDIX A.1 Number systems and codes Since ten-fingered humans are addicted to the decimal system, and since computers

More information

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW FFT There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies A X = A + BW B Y = A BW B. Baas 442 FFT Dataflow Diagram Dataflow

More information

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT Design of Delay Efficient Arithmetic Based Split Radix FFT Nisha Laguri #1, K. Anusudha *2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India

More information

RapidIO TM Interconnect Specification Part 7: System and Device Inter-operability Specification

RapidIO TM Interconnect Specification Part 7: System and Device Inter-operability Specification RapidIO TM Interconnect Specification Part 7: System and Device Inter-operability Specification Rev. 1.3, 06/2005 Copyright RapidIO Trade Association RapidIO Trade Association Revision History Revision

More information

RapidIO Interconnect Specification Part 3: Common Transport Specification

RapidIO Interconnect Specification Part 3: Common Transport Specification RapidIO Interconnect Specification Part 3: Common Transport Specification Rev. 1.3, 06/2005 Copyright RapidIO Trade Association RapidIO Trade Association Revision History Revision Description Date 1.1

More information

AT03262: SAM D/R/L/C System Pin Multiplexer (SYSTEM PINMUX) Driver. Introduction. SMART ARM-based Microcontrollers APPLICATION NOTE

AT03262: SAM D/R/L/C System Pin Multiplexer (SYSTEM PINMUX) Driver. Introduction. SMART ARM-based Microcontrollers APPLICATION NOTE SMART ARM-based Microcontrollers AT03262: SAM D/R/L/C System Pin Multiplexer (SYSTEM PINMUX) Driver APPLICATION NOTE Introduction This driver for Atmel SMART ARM -based microcontrollers provides an interface

More information

SensView User Guide. Version 1.0 February 8, Copyright 2010 SENSR LLC. All Rights Reserved. R V1.0

SensView User Guide. Version 1.0 February 8, Copyright 2010 SENSR LLC. All Rights Reserved. R V1.0 SensView User Guide Version 1.0 February 8, 2010 Copyright 2010 SENSR LLC. All Rights Reserved. R001-419-V1.0 TABLE OF CONTENTS 1 PREAMBLE 3 1.1 Software License Agreement 3 2 INSTALLING SENSVIEW 5 2.1

More information

Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding

Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided

More information

Performance Analysis of Line Echo Cancellation Implementation Using TMS320C6201

Performance Analysis of Line Echo Cancellation Implementation Using TMS320C6201 Performance Analysis of Line Echo Cancellation Implementation Using TMS320C6201 Application Report: SPRA421 Zhaohong Zhang and Gunter Schmer Digital Signal Processing Solutions March 1998 IMPORTANT NOTICE

More information

Automator (Standard)

Automator (Standard) Automator (Standard) DLL Users Guide Available exclusively from PC Control Ltd. www.pc-control.co.uk 2017 Copyright PC Control Ltd. Revision 1.2 Contents 1. Introduction 2. DLL Reference 3. Using the DLL

More information

Using MMX Instructions to implement 2X 8-bit Image Scaling

Using MMX Instructions to implement 2X 8-bit Image Scaling Using MMX Instructions to implement 2X 8-bit Image Scaling Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection with

More information

An Efficient Vector/Matrix Multiply Routine using MMX Technology

An Efficient Vector/Matrix Multiply Routine using MMX Technology An Efficient Vector/Matrix Multiply Routine using MMX Technology Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection

More information

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

Arithmetic Processing

Arithmetic Processing CS/EE 5830/6830 VLSI ARCHITECTURE Chapter 1 Basic Number Representations and Arithmetic Algorithms Arithmetic Processing AP = (operands, operation, results, conditions, singularities) Operands are: Set

More information