Susumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,
|
|
- Gavin Cobb
- 5 years ago
- Views:
Transcription
1 Effective Implementations of Multi-Dimensional Radix-2 FFT Susumu Yamamoto Department of Applied Physics, University of Tokyo, Hongo, Bunkyo-ku, Tokyo, , JAPAN Abstract The Fast Fourier Transform (FFT), is well known as a fast method for the Discrete Fourier Transform (DFT), and the calculation time is in proportion to N logn, where N is the system size. But unfortunately, inappropriate implementation with no care for the structure of the target machine increases the proportional coefficient by a factor of 10. We propose effective implementations in the case of Multi-Dimensional radix-2 FFT for the recent RISC workstation and the vector-type supercomputer, respectively. Key words: FFT, Multi-Dimension, Radix-2, RISC, vector super computer PACS: c Computational techniques Preprint submitted to Elsevier Preprint 8 June 1999
2 1 Introduction By using the Fast Fourier Transform (FFT)[1] algorithm, we can do the Discrete Fourier Transform DFT within the calculation time in proportion to N log N, where N is system size, theoretically. But actually, its proportional coefficient depends fairly on implementation, and sometimes, depends even on the system size N. For example, recent RISCwork station has fast small cache memory and slow large main memory, then the performance of a small calculation is definitely better than that of a large calculation. When one intend to do multi-dimensional FFT, the system size grows up so rapidly with increase of length scale. So the one suffers from long calculation time. Our motivation is how we can do massive multi-dimensional FFT effectively. Because there are so many precise articles on FFT, we begin with just giving a few definitions and explanations proper to Multi-Dimensional radix-2 FFT. Suppose the transforming data is stored in the array X, and its dimension is as follows (Fortran like representation): DIMENSION X(0:N 1-1,0:N 2-1,...,0:N m -1) where N j =2 j ; j =1; 2; :::; m. Nextwe define DFT of X as X! k 1n 1 N 1! k 2n 2 N 2 n 1 ;n 2 ;:::;nm :::! kmnm Nm X(n 1;n 2 ; :::; n m ); (1) 2
3 where! N j def =exp( 2ßi N j ): (2) This operation is easily decomposed as follows: define N = N 1 N 2 :::N m, Q j = N Nj,and i) Do 1D-FFT on j-th index n j for each (n 1 ;:::;n j 1 ;n j+1 ;:::;n m ). ii) Do the operation i) on each j 2f1;:::;mg. So, one important point for fast calculation is the effective implementation of 1-dimensional FFT on multiple data sets. 2 Strategy for Effective Calculation 2.1 On RISC Workstation As written in previous section, recent RISCworkstations have fast but small cache (primary, secondary and, sometimes, third level) memory and slow but large main memory. Soitiswell known that locality of the operation is keypoint of effective calculation. The approach to this problem of the famous FFT library called FFTW" [2,3], is quite smart. The recursive call is employed to access main memory hierarchically. This technique is very effective in the case that the total amount 3
4 of data is not so much greater than the cache size. But in the multi-dimensional case, the size of data grows up so rapidly that the performance of FFTW in the 64x64x64 system is only 1 4 Figure 5). of the performance in 32x32x32 system (See Our strategy is based on a more simple idea, which everyone might try at the first time. The basic idea is that just copying 1-dimensional data to small work space and do 1-dimensional FFT, then put back to original array. This must be good at using cache memory effectively, because 1-dimensional data is small enough to be stored in cache memory usually. In other word, because 1-dimensional FFT is local enough compared with the size of cache memory. But, unfortunately, this simple idea does not succeed in most cases, because of incredibly slow memory copy between main memory and cache memory. We thought about the reason why the memory copying is so slow, and find out the way of reducing the time for copying. To understand this slow copying phenomenon, one must remind the architecture of recent RISC Micro Processing Units (hereafter MPU). They are doing burst transfer in order to compensate the latency of main memory access. So, non-contiguous block transfer takes long time in comparison with fast contiguous block transfer. To illustrate the situation, suppose 2-D data is given as follows : DIMENSION A(M,N) 4
5 Required Data A(0,0) waste A(0,1) A(0,2) waste burst transfer burst transfer Fig. 1. Illustration of burst transfer in the worst case: block size of burst transfer is equivalent to the size of 4 data, and only the first datum in each block is required. For the first dimension, 1-dimensional data fa(i; j)ji = 1;:::;Mg for any j are located on the contiguous block. But for the second dimension, data fa(i; j)jj = 1;:::; N)g for any i are aligned with stride M. Guess block sizeof burst transfer (the size of cache line) is equivalent to 4 data and find out that copying fa(i; j)jj =1;:::;N)g to work space causes that 3 4 of transferred data will be disposed (Figure 1). This means that the effective transfer rate will be degraded to the 1 4 of the theoretical maximum. This is a fatal penalty. In order to avoid this penalty, we have re-organized the loop structure such as explained below and in Figure 2 in the case of the second dimension. 1) Determining L, the size of work space (hereafter we also call L window size"), so that L is not greater than the size of cache memory. 2) Determining P, which stands for how many sets of 1-dimensional data are treated in the 1-dimensional FFT subroutine for multiple data sets. Given the size of 1-dimensional data M, thenp = L M 3) Copying multiple 1-dimensional data sets to the work space and doing mul- 5
6 tiple 1-dimensional FFT's as follows: do i=1, N, P do j=1, M do k=1, P W((j-1)*P+k)=A(i+k-1,j) end do end do ( calling subroutine for $P$ times 1-dimensional FFT ) do j=1, M do k=1, P A(i+k-1,j)=W((j-1)*P+k) end do end do end do Following this method, one data sets concerning 1-dimensional FFT is not allocated on contiguous region, but is aligned with stride P. Instead, P sets of 1-dimensional data are stored on cache entirely. Then the MPU need not access to slow main memory until P times 1-dimensional FFT completes. Because recent MPU's have generic registers, desirable number of temporary variables is So we have used upto 16 point butterfly operation on multiple data sets. 6
7 MSB(Most Significant Bit) LSB(Least Significant Bit) z axis y axis x axis MSB LSB Cache Size (216 data) MSB LSB FFT Size (27) Multiplicity (29) Merging into One Work Space (216 data) Fig. 2. Illustration of Cache Scheme We does not used self-sorting in-place algorithm such asgpfa[5 8], in spite of its good property that memory access is minimum. Oncewe tried this sophisticated algorithm and found that it needs 2Np 2 (factor 2 is for complex calculation) temporary variables to do N p point butterfly operation, because GPFA exchanges lowest log 2 N p bits and highest log 2 N p bits in the array index during N p point butterfly operation. Instead of introducing self-sorting in-place algorithm, we have done the reversed bit ordering at the same time as putting back to the original array from work space. This implementation reduces the penalty of the reordering. 7
8 2.2 On Vector Type Super Computer Most of vector type super computers are designed under completely different policy from RISC workstation. A major difference is that super computers have large number of sets of signal lines between MPU and main memory,which are called bank". This structure can compensate the latency of main memory access, especially for contiguous access. Generally speaking, super computer does not need cache memory for the sake of its multi-bank structure. Another difference is the origin of the name of vector type". They havevector registers and the pipelines for vector registers. The weak point ofmulti-bank structure appears when the largest common divisor of B and S does not equal to 1, where B means the the number of bank", and S means the stride" of memory access. Memory accesses concentrate on specific banks and the bandwidth decrease. This performance decrease is called bank conflict". There are two well known strategy for effective use of vector type super computers. One is that the inmost loop count should be long enough to use vector registers effectively. The other is to avoid the allocation of the array variables which have the sizes of powersof2.thisisbecausethenumbers of banks equals to powers of 2 on most super computers. Allocating 2 n +1blocks in- 8
9 stead of 2 n blocks, then we find bank conflict does not occur moreover. In present implementation, we have maximized the in-most loop count, but doesn't used the latter method. In order to maximize the loop count of the inmost loop, we are doing 1- dimensional FFT on the dimension whose index occupies the Most Significant Bit (hereafter MSB) in the addressing space of given array variables. Of course, we are also doing transposition of the data, before doing 1-dimensional FFT on the next dimension. For example, consider the case that array X(i,j,k) (i; j; k 2 f1; 2;:::;32g) isgiven and we have to do FFT on array X.We do FFT on the last index k at beginning, then inmost loop is for (i; j). So, the inmost loop count is32 32 = This is an enough long loop for vector pipelines. The next FFT will be done on the second index j. Butif we don't reorganize the array variables, the total loop has rather complex structure. So before doing the next FFT, we transpose the array suchthat X(i; j; k)! X(k; i; j). In this strategy, transposing array Xmight cause the fatal bank-conflict penalty, because the size of arrayisapower of 2 and the numberofbankisalsoapower of 2. But we found special array transposition algorithm can resolve this problem. In order to avoid the bank conflict, wehave reorganized the loop structure as Figure 3. By using this algorithm, stride of accessing data does not equal to a power of 2. Therefore, the largest common divisor equals to 1 and no 9
10 z axis y axis x axis MSB LSB SOURCE(P,Q) P:(x,y), Q: z DESTINATION(Q,P) ADDRESS ADDRESS Fig. 3. Illustrating Transposition of Array without Bank Conflict. Numbers in the each cell describes the order of data transfer bank-conflict occurs. Typical super computer have multiple pipelines, then we have used upto 16 point butterfly operation on multiple data sets. Moreover, we are employing self-sorting but not in-place algorithm. It means as follows. We prepare the 2 arrays which have the same size, one for original data and the other for intermediate result. And we do bit-reverse ordering at the same time as butterfly operation. This point is similar to GPFA. But not having the in-place property causes the reduction of the number of temporary variables and effective 10
11 use of vector registers. As mentioned in the previous subsection, GPFA needs temporary variables to do 16 point butterfly operations and, our simple implementation just needs Results and Discussions Throughout present paper, we used FLOPS" as effective FLOPS", as same as the authors of FFTW did. Namely, we count the butterfly operation as 10 FLOP's, although actual averaged FLOP counts are smaller than On RISC Workstation First of all, we must fix the window size". Figure 4 clearly shows performance increases with the window sizeuntil the window size reaches 2 15 and decreases rapidly when the window size exceeds 2 16.Thisisvery reasonable result considering the machine has 2MB cache memory and window size of 2 16 corresponds to 2MB memory (notice 2 16 (size of complex) = 2MB). Figure 5 shows the 3D-FFT results in comparison with other algorithms. FFTW shows very good performance at each of smaller size result, but not at greater size. Present algorithm does not shows performance degradation until the biggest size. 11
12 250 Window size dependence of 3D-FFT performance MFLOPS 200 Performance(MFLOPS) Window Size (16byte unit) Fig. 4. Window size dependence of 3D-FFT Performance. Measurement isdone on the workstation model J2240 of Hewlett-Packard company (PA-8200, 236 MHz, 2MB/2MB instruction/data cache). Array size is 128x128x On Vector Type Super Computers On vector type super computers, the performance shows almost no dependence on the system size. So we justshows the result for the size of 128x128x128 array. Results for two systems, Hitachi S3800 and NEC SX-4, are shown in the table 3.2. In tables 3.2 MFFT means the library called MFFT" [9,10]. But original code is written for CRAY compiler, all the directives to the compiler is rewritten by the present author so as to match to a compiler on each machines. FFTW is not designed for vector type super computer. So this is a clear example which shows ignorance of targets structure causes unexpected results. ASL 12
13 Array Dimension v.s FFT Performance FFTW GPFA Present work (scalar version) Performance (MFLOPS) x4x4 8x8x8 16x16x16 32x32x32 Array Size 64x64x64 256x64x32 16x1024x64 128x128x128 Fig. 5. Size dependence of 3-dimensional FFT Speed. Measurement is done on the workstation model J2240 of Hewlett-Packard company (PA-8200, 236 MHz, 2MB/2MB instruction/data cache). means multiple 1-dimensional FFT routine from the library called ASL", which is issued by NEC. ASL does not include the multi-dimensional FFT code. So we attached our special transposition code ( a part of present work ) to the 1-dimensional FFT subroutine for multiple data in ASL. GPFA means original GPFA library. But original code is written for CRAY compiler, all the directives to the compiler is rewritten by the present author so as to match to a compiler on each machines. GPFA(+1) means the GPFA with the special array allocation such that leading dimension has the size of 2 n +1. GPFA(TRNS) means the GPFA with special transposition code (a part of 13
14 NEC SX-4 size 128x128x128 Hitachi S3800 size 128x128x128 Algorithm MFLOPS Present(vector version) 1: Algorithm MFLOPS GPFA 3: Present(vector version) 3: GPFA(+1) 1: GPFA 3: GPFA(TRNS) 1: GPFA(+1) 2: MFFT 1: GPFA(TRNS) 2: FFTW 2: ASL 1: Table 1 Results of calculation speed measurement on vector type super computers. present work). Comparing GPFA and GPFA(TRNS), then we found that performance gain by the improvement of the transposition routine reaches nearly factor 10. It is difficult to compare GPFA and present algorithm directly, because GPFA can do mixed-radix FFT and the latter cannot. And the result depends on the type of machine. If one needs only radix-2 FFT, present algorithm have some advantage, at least, on these machines. 14
15 4 Conclusion On RISC computers, locality of the algorithm is the key-point for effective calculations. FFT has the property that each resultant datum needs global information of original data. Therefore, it is hard to increase the locality of algorithm. But this situation changes when one needs multi-dimensional FFT, because typical multi-dimensional FFT is a massive collection of small onedimensional FFT. We introduced the control parameter for the locality of algorithm, named window size", and are succeeded in reducing the penalty of large calculation. On vector type super computers, effective use of vector pipelines and multi bank structure are the key-points for effective calculations. We construct a effective code using upto 16 point butterfly operation (i.e. radix-16 algorithm) and special matrix transposition routine for the matrix size of power of 2. Acknowledgements The present author thanks M. Frigo and S. G. Johnson who wrote bench FFT"[4] (and also FFTW"). Bench FFT" isvery impressive and useful for comparing FFT algorithms. Present work has been done on the computer facility of Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute, Institute of Molecular Science, 15
16 Computer Centre of University of Tokyo. Author thanks each organization a lot for use of each facility. References [1] J. W. Cooley and J. W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series, Math. Comp. 19 (1965) [2] M. Frigo and S. G. Johnson, The Fastest Fourier Transform in the West, MIT- LCS-TR-728(MIT technical report) (1997) [3] [4] [5] C. Temperton, Self-sorting mixed-radix Fast Fourier Transforms, J. Comput. Phys. 52 (1983) 1 23 [6] C. Temperton, Implementation of a Self-Sorting In-Place Prime Factor FFT Algorithm, J. Comput. Phys. 58 (1985) [7] C. Temperton, Self-Sorting In-Place Fast Fourier Transforms, SIAM J. Sci. Stat. Comput. 12 (1991) [8] C. Temperton, Generalized Prime Factor FFT Algorithm for Any N =2 p 3 q 5 r, SIAM J. Sci. Stat. Comput. 13 (1992) [9] A. Nobile and V. Roberto, EFFICIENT IMPLEMENTATION OF MULTIDIMENSIONAL FAST FOURIER TRANSFORMS ON A CRAY X- 16
17 MP, discreat Fourier transforms, Comput. Phys. Communs. 40 (1986) [10] A. Nobile and V. Roberto, MFFT: A PACKAGE FOR TWO- AND THREE- DIMENSIONAL VECTORIZED DISCRETE FOURIER TRANSFORMS, Comput. Phys. Communs. 42 (1986)
Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan
Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationAn Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston
An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationAdaptive Transpose Algorithms for Distributed Multicore Processors
Adaptive Transpose Algorithms for Distributed Multicore Processors John C. Bowman and Malcolm Roberts University of Alberta and Université de Strasbourg April 15, 2016 www.math.ualberta.ca/ bowman/talks
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationSystem Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay, and Markus Püschel Department of Electrical and Computer
More informationFFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES
FFT ALGORITHMS FOR MULTIPLY-ADD ARCHITECTURES FRANCHETTI Franz, (AUT), KALTENBERGER Florian, (AUT), UEBERHUBER Christoph W. (AUT) Abstract. FFTs are the single most important algorithms in science and
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More information2 Keywords: fast Fourier transform, distributed-memory, vector processor, parallel algorithm, Stockham's algorithm, cyclic distribution, block cyclic
Chapter 1 VECTOR-PARALLEL ALGORITHMS FOR 1-DIMENSIONAL FAST FOURIER TRANSFORM Yusaku Yamamoto Dept. of Computational Science and Engineering Nagoya University yamamoto@na.cse.nagoya-u.ac.jp Hiroki Kawamura
More informationIntroduction to HPC. Lecture 21
443 Introduction to HPC Lecture Dept of Computer Science 443 Fast Fourier Transform 443 FFT followed by Inverse FFT DIF DIT Use inverse twiddles for the inverse FFT No bitreversal necessary! 443 FFT followed
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationModule 7.2: nag sym fft Symmetric Discrete Fourier Transforms. Contents
Transforms Module Contents Module 7.2: nag sym fft Symmetric Discrete Fourier Transforms nag sym fft provides procedures for computations involving one-dimensional real symmetric discrete Fourier transforms.
More informationEnergy Optimizations for FPGA-based 2-D FFT Architecture
Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline
More informationThe Design and Implementation of FFTW3
The Design and Implementation of FFTW3 MATTEO FRIGO AND STEVEN G. JOHNSON Invited Paper FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize
More informationAlgorithms and Computation in Signal Processing
Algorithms and Computation in Signal Processing special topic course 18-799B spring 2005 14 th Lecture Feb. 24, 2005 Instructor: Markus Pueschel TA: Srinivas Chellappa Course Evaluation Email sent out
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More information3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationParallel Fast Fourier Transform implementations in Julia 12/15/2011
Parallel Fast Fourier Transform implementations in Julia 1/15/011 Abstract This paper examines the parallel computation models of Julia through several different multiprocessor FFT implementations of 1D
More informationAMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11
AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.
More informationAdvanced cache optimizations. ECE 154B Dmitri Strukov
Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache
More informationAMath 483/583 Lecture 11
AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.
More informationFFTs in External or Hierarchical Memory David H. Bailey September 13, 1989
NASA-TM-II2249 To Appear in the Journal of Supercomputing RNR-89-004 FFTs in External or Hierarchical Memory David H. Bailey September 13, 1989 Abstract Conventional algorithms for computing large one-dimensional
More informationA Memory-Efficient Implementation of a Plasmonics Simulation Application on SX-ACE
International Journal of Networking and Computing www.ijnc.org ISSN 2185-2839 (print) ISSN 2185-2847 (online) Volume 6, Number 2, pages 243 262, July 2016 A Memory-Efficient Implementation of a Plasmonics
More informationUsing a Scalable Parallel 2D FFT for Image Enhancement
Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for
More informationAdvanced Topics in Computer Architecture
Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing
More informationDESIGN METHODOLOGY. 5.1 General
87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods
More informationTwiddle Factor Transformation for Pipelined FFT Processing
Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,
More informationTHE orthogonal frequency-division multiplex (OFDM)
26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationA Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing
A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms Paper Reading Group Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran Presents: Maksym Planeta 03.09.2015 Table of Contents Introduction Cache-oblivious algorithms
More informationAbstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs
Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant
More informationA scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment
LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationNAG Library Function Document nag_sum_fft_real_3d (c06pyc)
NAG Library Function Document nag_sum_fft_real_3d () 1 Purpose nag_sum_fft_real_3d () computes the three-dimensional discrete Fourier transform of a trivariate sequence of real data values. 2 Specification
More informationChapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1
Chapter 9 Memory Management Main Memory Operating system concepts. Sixth Edition. Silberschatz, Galvin, and Gagne 8.1 Chapter 9: Memory Management Background Swapping Contiguous Memory Allocation Segmentation
More informationChapter 8 :: Composite Types
Chapter 8 :: Composite Types Programming Language Pragmatics, Fourth Edition Michael L. Scott Copyright 2016 Elsevier 1 Chapter08_Composite_Types_4e - Tue November 21, 2017 Records (Structures) and Variants
More informationInvestigation of Intel MIC for implementation of Fast Fourier Transform
Investigation of Intel MIC for implementation of Fast Fourier Transform Soren Goyal Department of Physics IIT Kanpur e-mail address: soren@iitk.ac.in The objective of the project was to run the code for
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationA Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD
A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD KELEFOURAS, Vasileios , KRITIKAKOU, Angeliki and GOUTIS, Costas Available
More informationFFT. There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies X = A + BW
FFT There are many ways to decompose an FFT [Rabiner and Gold] The simplest ones are radix-2 Computation made up of radix-2 butterflies A X = A + BW B Y = A BW B. Baas 442 FFT Dataflow Diagram Dataflow
More informationc 2013 Alexander Jih-Hing Yee
c 2013 Alexander Jih-Hing Yee A FASTER FFT IN THE MID-WEST BY ALEXANDER JIH-HING YEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science
More informationEnergy Efficient Adaptive Beamforming on Sensor Networks
Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1 Instructors: Nicholas Weaver & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Components of a Computer Processor
More informationCray T3E High Performance Vectorized Libm Intrinsics
Cray T3E High Performance Vectorized Libm Intrinsics Neal Gaarder, System Libraries Group Cray Research, Inc. 655-F Lone Oak Drive Eagan, Minnesota, USA nealg@cray.com www.cray.com ABSTRACT: This paper
More informationResource Guide Implementing QoS for WX/WXC Application Acceleration Platforms
Resource Guide Implementing QoS for WX/WXC Application Acceleration Platforms Juniper Networks, Inc. 1194 North Mathilda Avenue Sunnyvale, CA 94089 USA 408 745 2000 or 888 JUNIPER www.juniper.net Table
More informationMemory management: outline
Memory management: outline Concepts Swapping Paging o Multi-level paging o TLB & inverted page tables 1 Memory size/requirements are growing 1951: the UNIVAC computer: 1000 72-bit words! 1971: the Cray
More informationMemory management: outline
Memory management: outline Concepts Swapping Paging o Multi-level paging o TLB & inverted page tables 1 Memory size/requirements are growing 1951: the UNIVAC computer: 1000 72-bit words! 1971: the Cray
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationEfficient FFT Algorithm and Programming Tricks
Connexions module: m12021 1 Efficient FFT Algorithm and Programming Tricks Douglas L. Jones This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License Abstract
More informationLow Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm
Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,
More informationDigital Signal Processing. Soma Biswas
Digital Signal Processing Soma Biswas 2017 Partial credit for slides: Dr. Manojit Pramanik Outline What is FFT? Types of FFT covered in this lecture Decimation in Time (DIT) Decimation in Frequency (DIF)
More informationSmall Discrete Fourier Transforms on GPUs
Small Discrete Fourier Transforms on GPUs S. Mitra and A. Srinivasan Dept. of Computer Science, Florida State University, Tallahassee, FL 32306, USA {mitra,asriniva}@cs.fsu.edu Abstract Efficient implementations
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More informationARCHITECTURES FOR PARALLEL COMPUTATION
Datorarkitektur Fö 11/12-1 Datorarkitektur Fö 11/12-2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationCS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches
CS 152 Computer Architecture and Engineering Lecture 11 - Virtual Memory and Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationParallel-computing approach for FFT implementation on digital signal processor (DSP)
Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationEfficiency Aspects for Advanced Fluid Finite Element Formulations
Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de
More informationFormal Loop Merging for Signal Transforms
Formal Loop Merging for Signal Transforms Franz Franchetti Yevgen S. Voronenko Markus Püschel Department of Electrical & Computer Engineering Carnegie Mellon University This work was supported by NSF through
More informationImage Processing. Application area chosen because it has very good parallelism and interesting output.
Chapter 11 Slide 517 Image Processing Application area chosen because it has very good parallelism and interesting output. Low-level Image Processing Operates directly on stored image to improve/enhance
More informationHi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides
Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides for both problems first, and let you guys code them
More informationMemory and multiprogramming
Memory and multiprogramming COMP342 27 Week 5 Dr Len Hamey Reading TW: Tanenbaum and Woodhull, Operating Systems, Third Edition, chapter 4. References (computer architecture): HP: Hennessy and Patterson
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationChapter 11 Image Processing
Chapter Image Processing Low-level Image Processing Operates directly on a stored image to improve or enhance it. Stored image consists of a two-dimensional array of pixels (picture elements): Origin (0,
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationMemory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005
Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance
More informationPit Pattern Classification of Zoom-Endoscopic Colon Images using D
Pit Pattern Classification of Zoom-Endoscopic Colon Images using DCT and FFT Leonhard Brunauer Hannes Payer Robert Resch Department of Computer Science University of Salzburg February 1, 2007 Outline 1
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationResource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs
In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin
More informationLOW-POWER SPLIT-RADIX FFT PROCESSORS
LOW-POWER SPLIT-RADIX FFT PROCESSORS Avinash 1, Manjunath Managuli 2, Suresh Babu D 3 ABSTRACT To design a split radix fast Fourier transform is an ideal person for the implementing of a low-power FFT
More informationAn Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation
An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation Syunji Yazaki Kôki Abe Abstract We designed a VLSI chip of FFT multiplier based on simple Cooly Tukey FFT using a floating-point
More informationENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM. Dr. Lim Chee Chin
ENT 315 Medical Signal Processing CHAPTER 3 FAST FOURIER TRANSFORM Dr. Lim Chee Chin Outline Definition and Introduction FFT Properties of FFT Algorithm of FFT Decimate in Time (DIT) FFT Steps for radix
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationThe Fast Fourier Transform Algorithm and Its Application in Digital Image Processing
The Fast Fourier Transform Algorithm and Its Application in Digital Image Processing S.Arunachalam(Associate Professor) Department of Mathematics, Rizvi College of Arts, Science & Commerce, Bandra (West),
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationIntroduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far
Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationLearning to Construct Fast Signal Processing Implementations
Journal of Machine Learning Research 3 (2002) 887-919 Submitted 12/01; Published 12/02 Learning to Construct Fast Signal Processing Implementations Bryan Singer Manuela Veloso Department of Computer Science
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
HPEC 2004 Abstract Submission Dillon Engineering, Inc. www.dilloneng.com An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Tom Dillon Dillon Engineering, Inc. This presentation outlines
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationChapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,
Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationCOMP171. Hashing.
COMP171 Hashing Hashing 2 Hashing Again, a (dynamic) set of elements in which we do search, insert, and delete Linear ones: lists, stacks, queues, Nonlinear ones: trees, graphs (relations between elements
More informationUniversity of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.
University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be
More informationA Hybrid GPU/CPU FFT Library for Large FFT Problems
A Hybrid GPU/CPU FFT Library for Large FFT Problems Shuo Chen Dept. of Electrical and Computer Engineering University of Delaware Newark, DE 19716 schen@udel.edu Abstract Graphic Processing Units (GPU)
More informationOn The Computational Cost of FFT-Based Linear Convolutions. David H. Bailey 07 June 1996 Ref: Not published
On The Computational Cost of FFT-Based Linear Convolutions David H. Bailey 07 June 1996 Ref: Not published Abstract The linear convolution of two n-long sequences x and y is commonly performed by extending
More informationThe Fast Fourier Transform
Chapter 7 7.1 INTRODUCTION The Fast Fourier Transform In Chap. 6 we saw that the discrete Fourier transform (DFT) could be used to perform convolutions. In this chapter we look at the computational requirements
More informationNAG Library Function Document nag_sum_fft_cosine (c06rfc)
NAG Library Function Document nag_sum_fft_cosine () 1 Purpose nag_sum_fft_cosine () computes the discrete Fourier cosine transforms of m sequences of real data values. The elements of each sequence and
More information