Susumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,

Size: px
Start display at page:

Download "Susumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,"

Transcription

1 Effective Implementations of Multi-Dimensional Radix-2 FFT Susumu Yamamoto Department of Applied Physics, University of Tokyo, Hongo, Bunkyo-ku, Tokyo, , JAPAN Abstract The Fast Fourier Transform (FFT), is well known as a fast method for the Discrete Fourier Transform (DFT), and the calculation time is in proportion to N logn, where N is the system size. But unfortunately, inappropriate implementation with no care for the structure of the target machine increases the proportional coefficient by a factor of 10. We propose effective implementations in the case of Multi-Dimensional radix-2 FFT for the recent RISC workstation and the vector-type supercomputer, respectively. Key words: FFT, Multi-Dimension, Radix-2, RISC, vector super computer PACS: c Computational techniques Preprint submitted to Elsevier Preprint 8 June 1999

2 1 Introduction By using the Fast Fourier Transform (FFT)[1] algorithm, we can do the Discrete Fourier Transform DFT within the calculation time in proportion to N log N, where N is system size, theoretically. But actually, its proportional coefficient depends fairly on implementation, and sometimes, depends even on the system size N. For example, recent RISCwork station has fast small cache memory and slow large main memory, then the performance of a small calculation is definitely better than that of a large calculation. When one intend to do multi-dimensional FFT, the system size grows up so rapidly with increase of length scale. So the one suffers from long calculation time. Our motivation is how we can do massive multi-dimensional FFT effectively. Because there are so many precise articles on FFT, we begin with just giving a few definitions and explanations proper to Multi-Dimensional radix-2 FFT. Suppose the transforming data is stored in the array X, and its dimension is as follows (Fortran like representation): DIMENSION X(0:N 1-1,0:N 2-1,...,0:N m -1) where N j =2 j ; j =1; 2; :::; m. Nextwe define DFT of X as X! k 1n 1 N 1! k 2n 2 N 2 n 1 ;n 2 ;:::;nm :::! kmnm Nm X(n 1;n 2 ; :::; n m ); (1) 2

3 where! N j def =exp( 2ßi N j ): (2) This operation is easily decomposed as follows: define N = N 1 N 2 :::N m, Q j = N Nj,and i) Do 1D-FFT on j-th index n j for each (n 1 ;:::;n j 1 ;n j+1 ;:::;n m ). ii) Do the operation i) on each j 2f1;:::;mg. So, one important point for fast calculation is the effective implementation of 1-dimensional FFT on multiple data sets. 2 Strategy for Effective Calculation 2.1 On RISC Workstation As written in previous section, recent RISCworkstations have fast but small cache (primary, secondary and, sometimes, third level) memory and slow but large main memory. Soitiswell known that locality of the operation is keypoint of effective calculation. The approach to this problem of the famous FFT library called FFTW" [2,3], is quite smart. The recursive call is employed to access main memory hierarchically. This technique is very effective in the case that the total amount 3

4 of data is not so much greater than the cache size. But in the multi-dimensional case, the size of data grows up so rapidly that the performance of FFTW in the 64x64x64 system is only 1 4 Figure 5). of the performance in 32x32x32 system (See Our strategy is based on a more simple idea, which everyone might try at the first time. The basic idea is that just copying 1-dimensional data to small work space and do 1-dimensional FFT, then put back to original array. This must be good at using cache memory effectively, because 1-dimensional data is small enough to be stored in cache memory usually. In other word, because 1-dimensional FFT is local enough compared with the size of cache memory. But, unfortunately, this simple idea does not succeed in most cases, because of incredibly slow memory copy between main memory and cache memory. We thought about the reason why the memory copying is so slow, and find out the way of reducing the time for copying. To understand this slow copying phenomenon, one must remind the architecture of recent RISC Micro Processing Units (hereafter MPU). They are doing burst transfer in order to compensate the latency of main memory access. So, non-contiguous block transfer takes long time in comparison with fast contiguous block transfer. To illustrate the situation, suppose 2-D data is given as follows : DIMENSION A(M,N) 4

5 Required Data A(0,0) waste A(0,1) A(0,2) waste burst transfer burst transfer Fig. 1. Illustration of burst transfer in the worst case: block size of burst transfer is equivalent to the size of 4 data, and only the first datum in each block is required. For the first dimension, 1-dimensional data fa(i; j)ji = 1;:::;Mg for any j are located on the contiguous block. But for the second dimension, data fa(i; j)jj = 1;:::; N)g for any i are aligned with stride M. Guess block sizeof burst transfer (the size of cache line) is equivalent to 4 data and find out that copying fa(i; j)jj =1;:::;N)g to work space causes that 3 4 of transferred data will be disposed (Figure 1). This means that the effective transfer rate will be degraded to the 1 4 of the theoretical maximum. This is a fatal penalty. In order to avoid this penalty, we have re-organized the loop structure such as explained below and in Figure 2 in the case of the second dimension. 1) Determining L, the size of work space (hereafter we also call L window size"), so that L is not greater than the size of cache memory. 2) Determining P, which stands for how many sets of 1-dimensional data are treated in the 1-dimensional FFT subroutine for multiple data sets. Given the size of 1-dimensional data M, thenp = L M 3) Copying multiple 1-dimensional data sets to the work space and doing mul- 5

6 tiple 1-dimensional FFT's as follows: do i=1, N, P do j=1, M do k=1, P W((j-1)*P+k)=A(i+k-1,j) end do end do ( calling subroutine for $P$ times 1-dimensional FFT ) do j=1, M do k=1, P A(i+k-1,j)=W((j-1)*P+k) end do end do end do Following this method, one data sets concerning 1-dimensional FFT is not allocated on contiguous region, but is aligned with stride P. Instead, P sets of 1-dimensional data are stored on cache entirely. Then the MPU need not access to slow main memory until P times 1-dimensional FFT completes. Because recent MPU's have generic registers, desirable number of temporary variables is So we have used upto 16 point butterfly operation on multiple data sets. 6

7 MSB(Most Significant Bit) LSB(Least Significant Bit) z axis y axis x axis MSB LSB Cache Size (216 data) MSB LSB FFT Size (27) Multiplicity (29) Merging into One Work Space (216 data) Fig. 2. Illustration of Cache Scheme We does not used self-sorting in-place algorithm such asgpfa[5 8], in spite of its good property that memory access is minimum. Oncewe tried this sophisticated algorithm and found that it needs 2Np 2 (factor 2 is for complex calculation) temporary variables to do N p point butterfly operation, because GPFA exchanges lowest log 2 N p bits and highest log 2 N p bits in the array index during N p point butterfly operation. Instead of introducing self-sorting in-place algorithm, we have done the reversed bit ordering at the same time as putting back to the original array from work space. This implementation reduces the penalty of the reordering. 7

8 2.2 On Vector Type Super Computer Most of vector type super computers are designed under completely different policy from RISC workstation. A major difference is that super computers have large number of sets of signal lines between MPU and main memory,which are called bank". This structure can compensate the latency of main memory access, especially for contiguous access. Generally speaking, super computer does not need cache memory for the sake of its multi-bank structure. Another difference is the origin of the name of vector type". They havevector registers and the pipelines for vector registers. The weak point ofmulti-bank structure appears when the largest common divisor of B and S does not equal to 1, where B means the the number of bank", and S means the stride" of memory access. Memory accesses concentrate on specific banks and the bandwidth decrease. This performance decrease is called bank conflict". There are two well known strategy for effective use of vector type super computers. One is that the inmost loop count should be long enough to use vector registers effectively. The other is to avoid the allocation of the array variables which have the sizes of powersof2.thisisbecausethenumbers of banks equals to powers of 2 on most super computers. Allocating 2 n +1blocks in- 8

9 stead of 2 n blocks, then we find bank conflict does not occur moreover. In present implementation, we have maximized the in-most loop count, but doesn't used the latter method. In order to maximize the loop count of the inmost loop, we are doing 1- dimensional FFT on the dimension whose index occupies the Most Significant Bit (hereafter MSB) in the addressing space of given array variables. Of course, we are also doing transposition of the data, before doing 1-dimensional FFT on the next dimension. For example, consider the case that array X(i,j,k) (i; j; k 2 f1; 2;:::;32g) isgiven and we have to do FFT on array X.We do FFT on the last index k at beginning, then inmost loop is for (i; j). So, the inmost loop count is32 32 = This is an enough long loop for vector pipelines. The next FFT will be done on the second index j. Butif we don't reorganize the array variables, the total loop has rather complex structure. So before doing the next FFT, we transpose the array suchthat X(i; j; k)! X(k; i; j). In this strategy, transposing array Xmight cause the fatal bank-conflict penalty, because the size of arrayisapower of 2 and the numberofbankisalsoapower of 2. But we found special array transposition algorithm can resolve this problem. In order to avoid the bank conflict, wehave reorganized the loop structure as Figure 3. By using this algorithm, stride of accessing data does not equal to a power of 2. Therefore, the largest common divisor equals to 1 and no 9

10 z axis y axis x axis MSB LSB SOURCE(P,Q) P:(x,y), Q: z DESTINATION(Q,P) ADDRESS ADDRESS Fig. 3. Illustrating Transposition of Array without Bank Conflict. Numbers in the each cell describes the order of data transfer bank-conflict occurs. Typical super computer have multiple pipelines, then we have used upto 16 point butterfly operation on multiple data sets. Moreover, we are employing self-sorting but not in-place algorithm. It means as follows. We prepare the 2 arrays which have the same size, one for original data and the other for intermediate result. And we do bit-reverse ordering at the same time as butterfly operation. This point is similar to GPFA. But not having the in-place property causes the reduction of the number of temporary variables and effective 10

11 use of vector registers. As mentioned in the previous subsection, GPFA needs temporary variables to do 16 point butterfly operations and, our simple implementation just needs Results and Discussions Throughout present paper, we used FLOPS" as effective FLOPS", as same as the authors of FFTW did. Namely, we count the butterfly operation as 10 FLOP's, although actual averaged FLOP counts are smaller than On RISC Workstation First of all, we must fix the window size". Figure 4 clearly shows performance increases with the window sizeuntil the window size reaches 2 15 and decreases rapidly when the window size exceeds 2 16.Thisisvery reasonable result considering the machine has 2MB cache memory and window size of 2 16 corresponds to 2MB memory (notice 2 16 (size of complex) = 2MB). Figure 5 shows the 3D-FFT results in comparison with other algorithms. FFTW shows very good performance at each of smaller size result, but not at greater size. Present algorithm does not shows performance degradation until the biggest size. 11

12 250 Window size dependence of 3D-FFT performance MFLOPS 200 Performance(MFLOPS) Window Size (16byte unit) Fig. 4. Window size dependence of 3D-FFT Performance. Measurement isdone on the workstation model J2240 of Hewlett-Packard company (PA-8200, 236 MHz, 2MB/2MB instruction/data cache). Array size is 128x128x On Vector Type Super Computers On vector type super computers, the performance shows almost no dependence on the system size. So we justshows the result for the size of 128x128x128 array. Results for two systems, Hitachi S3800 and NEC SX-4, are shown in the table 3.2. In tables 3.2 MFFT means the library called MFFT" [9,10]. But original code is written for CRAY compiler, all the directives to the compiler is rewritten by the present author so as to match to a compiler on each machines. FFTW is not designed for vector type super computer. So this is a clear example which shows ignorance of targets structure causes unexpected results. ASL 12

13 Array Dimension v.s FFT Performance FFTW GPFA Present work (scalar version) Performance (MFLOPS) x4x4 8x8x8 16x16x16 32x32x32 Array Size 64x64x64 256x64x32 16x1024x64 128x128x128 Fig. 5. Size dependence of 3-dimensional FFT Speed. Measurement is done on the workstation model J2240 of Hewlett-Packard company (PA-8200, 236 MHz, 2MB/2MB instruction/data cache). means multiple 1-dimensional FFT routine from the library called ASL", which is issued by NEC. ASL does not include the multi-dimensional FFT code. So we attached our special transposition code ( a part of present work ) to the 1-dimensional FFT subroutine for multiple data in ASL. GPFA means original GPFA library. But original code is written for CRAY compiler, all the directives to the compiler is rewritten by the present author so as to match to a compiler on each machines. GPFA(+1) means the GPFA with the special array allocation such that leading dimension has the size of 2 n +1. GPFA(TRNS) means the GPFA with special transposition code (a part of 13

14 NEC SX-4 size 128x128x128 Hitachi S3800 size 128x128x128 Algorithm MFLOPS Present(vector version) 1: Algorithm MFLOPS GPFA 3: Present(vector version) 3: GPFA(+1) 1: GPFA 3: GPFA(TRNS) 1: GPFA(+1) 2: MFFT 1: GPFA(TRNS) 2: FFTW 2: ASL 1: Table 1 Results of calculation speed measurement on vector type super computers. present work). Comparing GPFA and GPFA(TRNS), then we found that performance gain by the improvement of the transposition routine reaches nearly factor 10. It is difficult to compare GPFA and present algorithm directly, because GPFA can do mixed-radix FFT and the latter cannot. And the result depends on the type of machine. If one needs only radix-2 FFT, present algorithm have some advantage, at least, on these machines. 14

15 4 Conclusion On RISC computers, locality of the algorithm is the key-point for effective calculations. FFT has the property that each resultant datum needs global information of original data. Therefore, it is hard to increase the locality of algorithm. But this situation changes when one needs multi-dimensional FFT, because typical multi-dimensional FFT is a massive collection of small onedimensional FFT. We introduced the control parameter for the locality of algorithm, named window size", and are succeeded in reducing the penalty of large calculation. On vector type super computers, effective use of vector pipelines and multi bank structure are the key-points for effective calculations. We construct a effective code using upto 16 point butterfly operation (i.e. radix-16 algorithm) and special matrix transposition routine for the matrix size of power of 2. Acknowledgements The present author thanks M. Frigo and S. G. Johnson who wrote bench FFT"[4] (and also FFTW"). Bench FFT" isvery impressive and useful for comparing FFT algorithms. Present work has been done on the computer facility of Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute, Institute of Molecular Science, 15

16 Computer Centre of University of Tokyo. Author thanks each organization a lot for use of each facility. References [1] J. W. Cooley and J. W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series, Math. Comp. 19 (1965) [2] M. Frigo and S. G. Johnson, The Fastest Fourier Transform in the West, MIT- LCS-TR-728(MIT technical report) (1997) [3] [4] [5] C. Temperton, Self-sorting mixed-radix Fast Fourier Transforms, J. Comput. Phys. 52 (1983) 1 23 [6] C. Temperton, Implementation of a Self-Sorting In-Place Prime Factor FFT Algorithm, J. Comput. Phys. 58 (1985) [7] C. Temperton, Self-Sorting In-Place Fast Fourier Transforms, SIAM J. Sci. Stat. Comput. 12 (1991) [8] C. Temperton, Generalized Prime Factor FFT Algorithm for Any N =2 p 3 q 5 r, SIAM J. Sci. Stat. Comput. 13 (1992) [9] A. Nobile and V. Roberto, EFFICIENT IMPLEMENTATION OF MULTIDIMENSIONAL FAST FOURIER TRANSFORMS ON A CRAY X- 16

17 MP, discreat Fourier transforms, Comput. Phys. Communs. 40 (1986) [10] A. Nobile and V. Roberto, MFFT: A PACKAGE FOR TWO- AND THREE- DIMENSIONAL VECTORIZED DISCRETE FOURIER TRANSFORMS, Comput. Phys. Communs. 42 (1986)

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Adaptive Transpose Algorithms for Distributed Multicore Processors

Adaptive Transpose Algorithms for Distributed Multicore Processors Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks

More information

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer

More information

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES

FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

2 Keywords: fast Fourier transform, distributed-memory, vector processor, parallel algorithm, Stockham's algorithm, cyclic distribution, block cyclic

2 Keywords: fast Fourier transform, distributed-memory, vector processor, parallel algorithm, Stockham's algorithm, cyclic distribution, block cyclic Chapter 1 VECTOR-PARALLEL ALGORITHMS FOR 1-DIMENSIONAL FAST FOURIER TRANSFORM Yusaku Yamamoto Dept. of Computational Science and Engineering Nagoya University yamamoto@na.cse.nagoya-u.ac.jp Hiroki Kawamura

More information

Introduction to HPC. Lecture 21

Introduction to HPC. Lecture 21 443 Introduction to HPC Lecture Dept of Computer Science 443 Fast Fourier Transform 443 FFT followed by Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! 443 FFT followed

More information

Optimising for the p690 memory system

Optimising for the p690 memory system Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Module 7.2: nag sym fft Symmetric Discrete Fourier Transforms. Contents

Module 7.2: nag sym fft Symmetric Discrete Fourier Transforms. Contents Transforms Module Contents Module 7.2: nag sym fft Symmetric Discrete Fourier Transforms nag sym fft provides procedures for computations involving one-dimensional real symmetric discrete Fourier transforms.

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

The Design and Implementation of FFTW3

The Design and Implementation of FFTW3 The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize

More information

Algorithms and Computation in Signal Processing

Algorithms and Computation in Signal Processing Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms 3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Parallel Fast Fourier Transform implementations in Julia 12/15/2011

Parallel Fast Fourier Transform implementations in Julia 12/15/2011 Parallel Fast Fourier Transform implementations in Julia 1/15/011 Abstract This paper examines the parallel computation models of Julia through several different multiprocessor FFT implementations of 1D

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

FFTs in External or Hierarchical Memory David H. Bailey September 13, 1989

FFTs in External or Hierarchical Memory David H. Bailey September 13, 1989 NASA-TM-II2249 To Appear in the Journal of Supercomputing RNR-89-004 FFTs in External or Hierarchical Memory David H. Bailey September 13, 1989 Abstract Conventional algorithms for computing large one-dimensional

More information

A Memory-Efficient Implementation of a Plasmonics Simulation Application on SX-ACE

A Memory-Efficient Implementation of a Plasmonics Simulation Application on SX-ACE International Journal of Networking and Computing www.ijnc.org ISSN 2185-2839 (print) ISSN 2185-2847 (online) Volume 6, Number 2, pages 243 262, July 2016 A Memory-Efficient Implementation of a Plasmonics

More information

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

Advanced Topics in Computer Architecture

Advanced Topics in Computer Architecture Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms

More information

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs

Abstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

NAG Library Function Document nag_sum_fft_real_3d (c06pyc)

NAG Library Function Document nag_sum_fft_real_3d (c06pyc) NAG Library Function Document nag_sum_fft_real_3d () 1 Purpose nag_sum_fft_real_3d () computes the three-dimensional discrete Fourier transform of a trivariate sequence of real data values. 2 Specification

More information

Chapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1

Chapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1 Chapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1 Chapter 9: Memory Management Background Swapping Contiguous Memory Allocation Segmentation

More information

Chapter 8 :: Composite Types

Chapter 8 :: Composite Types Chapter 8 :: Composite Types Programming Language Pragmatics, Fourth Edition Michael L. Scott Copyright 2016 Elsevier 1 Chapter08_Composite_Types_4e - Tue November 21, 2017 Records (Structures) and Variants

More information

Investigation of Intel MIC for implementation of Fast Fourier Transform

Investigation of Intel MIC for implementation of Fast Fourier Transform Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available

More information

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW

FFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW FFT There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies A X = A + BW B Y = A BW B. Baas 442 FFT Dataflow Diagram Dataflow

More information

c 2013 Alexander Jih-Hing Yee

c 2013 Alexander Jih-Hing Yee c 2013 Alexander Jih-Hing Yee A FASTER FFT IN THE MID-WEST BY ALEXANDER JIH-HING YEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

More information

Energy Efficient Adaptive Beamforming on Sensor Networks

Energy Efficient Adaptive Beamforming on Sensor Networks Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1 Instructors: Nicholas Weaver & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Components of a Computer Processor

More information

Cray T3E High Performance Vectorized Libm Intrinsics

Cray T3E High Performance Vectorized Libm Intrinsics Cray T3E High Performance Vectorized Libm Intrinsics Neal Gaarder, System Libraries Group Cray Research, Inc. 655-F Lone Oak Drive Eagan, Minnesota, USA nealg@cray.com www.cray.com ABSTRACT: This paper

More information

Resource Guide Implementing QoS for WX/WXC Application Acceleration Platforms

Resource Guide Implementing QoS for WX/WXC Application Acceleration Platforms Resource Guide Implementing QoS for WX/WXC Application Acceleration Platforms Juniper Networks, Inc. 1194 North Mathilda Avenue Sunnyvale, CA 94089 USA 408 745 2000 or 888 JUNIPER www.juniper.net Table

More information

Memory management: outline

Memory management: outline Memory management: outline Concepts Swapping Paging o Multi-level paging o TLB & inverted page tables 1 Memory size/requirements are growing 1951: the UNIVAC computer: 1000 72-bit words! 1971: the Cray

More information

Memory management: outline

Memory management: outline Memory management: outline Concepts Swapping Paging o Multi-level paging o TLB & inverted page tables 1 Memory size/requirements are growing 1951: the UNIVAC computer: 1000 72-bit words! 1971: the Cray

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Efficient FFT Algorithm and Programming Tricks

Efficient FFT Algorithm and Programming Tricks Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Digital Signal Processing. Soma Biswas

Digital Signal Processing. Soma Biswas Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)

More information

Small Discrete Fourier Transforms on GPUs

Small Discrete Fourier Transforms on GPUs Small Discrete Fourier Transforms on GPUs S. Mitra and A. Srinivasan Dept. of Computer Science, Florida State University, Tallahassee, FL 32306, USA {mitra,asriniva}@cs.fsu.edu Abstract Efficient implementations

More information

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

ARCHITECTURES FOR PARALLEL COMPUTATION

ARCHITECTURES FOR PARALLEL COMPUTATION Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches CS 152 Computer Architecture and Engineering Lecture 11 - Virtual Memory and Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Efficiency Aspects for Advanced Fluid Finite Element Formulations

Efficiency Aspects for Advanced Fluid Finite Element Formulations Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de

More information

Formal Loop Merging for Signal Transforms

Formal Loop Merging for Signal Transforms Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through

More information

Image Processing. Application area chosen because it has very good parallelism and interesting output.

Image Processing. Application area chosen because it has very good parallelism and interesting output. Chapter 11 Slide 517 Image Processing Application area chosen because it has very good parallelism and interesting output. Low-level Image Processing Operates directly on stored image to improve/enhance

More information

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides for both problems first, and let you guys code them

More information

Memory and multiprogramming

Memory and multiprogramming Memory and multiprogramming COMP342 27 Week 5 Dr Len Hamey Reading TW: Tanenbaum and Woodhull, Operating Systems, Third Edition, chapter 4. References (computer architecture): HP: Hennessy and Patterson

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

Chapter 11 Image Processing

Chapter 11 Image Processing Chapter Image Processing Low-level Image Processing Operates directly on a stored image to improve or enhance it. Stored image consists of a two-dimensional array of pixels (picture elements): Origin (0,

More information

CS 2461: Computer Architecture 1

CS 2461: Computer Architecture 1 Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code

More information

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005 Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Pit Pattern Classification of Zoom-Endoscopic Colon Images using D

Pit Pattern Classification of Zoom-Endoscopic Colon Images using D Pit Pattern Classification of Zoom-Endoscopic Colon Images using DCT and FFT Leonhard Brunauer Hannes Payer Robert Resch Department of Computer Science University of Salzburg February 1, 2007 Outline 1

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin

More information

LOW-POWER SPLIT-RADIX FFT PROCESSORS

LOW-POWER SPLIT-RADIX FFT PROCESSORS LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT

More information

An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation

An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation Syunji Yazaki Kôki Abe Abstract We designed a VLSI chip of FFT multiplier based on simple Cooly Tukey FFT using a floating-point

More information

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin

ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM Dr. Lim Chee Chin Outline Definition and Introduction FFT Properties of FFT Algorithm of FFT Decimate in Time (DIT) FFT Steps for radix

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing

The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing S.Arunachalam(Associate Professor) Department of Mathematics, Rizvi College of Arts, Science & Commerce, Bandra (West),

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Learning to Construct Fast Signal Processing Implementations

Learning to Construct Fast Signal Processing Implementations Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science

More information

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs HPEC 2004 Abstract Submission Dillon Engineering, Inc. www.dilloneng.com An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Tom Dillon Dillon Engineering, Inc. This presentation outlines

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

COMP171. Hashing.

COMP171. Hashing. COMP171 Hashing Hashing 2 Hashing Again, a (dynamic) set of elements in which we do search, insert, and delete Linear ones: lists, stacks, queues, Nonlinear ones: trees, graphs (relations between elements

More information

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be

More information

A Hybrid GPU/CPU FFT Library for Large FFT Problems

A Hybrid GPU/CPU FFT Library for Large FFT Problems A Hybrid GPU/CPU FFT Library for Large FFT Problems Shuo Chen Dept. of Electrical and Computer Engineering University of Delaware Newark, DE 19716 schen@udel.edu Abstract Graphic Processing Units (GPU)

More information

On The Computational Cost of FFT-Based Linear Convolutions. David H. Bailey 07 June 1996 Ref: Not published

On The Computational Cost of FFT-Based Linear Convolutions. David H. Bailey 07 June 1996 Ref: Not published On The Computational Cost of FFT-Based Linear Convolutions David H. Bailey 07 June 1996 Ref: Not published Abstract The linear convolution of two n-long sequences x and y is commonly performed by extending

More information

The Fast Fourier Transform

The Fast Fourier Transform Chapter 7 7.1 INTRODUCTION The Fast Fourier Transform In Chap. 6 we saw that the discrete Fourier transform (DFT) could be used to perform convolutions. In this chapter we look at the computational requirements

More information

NAG Library Function Document nag_sum_fft_cosine (c06rfc)

NAG Library Function Document nag_sum_fft_cosine (c06rfc) NAG Library Function Document nag_sum_fft_cosine () 1 Purpose nag_sum_fft_cosine () computes the discrete Fourier cosine transforms of m sequences of real data values. The elements of each sequence and

More information