Susumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,

Size: px

Start display at page:

Download "Susumu Yamamoto. Tokyo, , JAPAN. Fourier Transform (DFT), and the calculation time is in proportion to N logn,"

Gavin Cobb
5 years ago
Views:

1 Effective Implementations of Multi-Dimensional Radix-2 FFT Susumu Yamamoto Department of Applied Physics, University of Tokyo, Hongo, Bunkyo-ku, Tokyo, , JAPAN Abstract The Fast Fourier Transform (FFT), is well known as a fast method for the Discrete Fourier Transform (DFT), and the calculation time is in proportion to N logn, where N is the system size. But unfortunately, inappropriate implementation with no care for the structure of the target machine increases the proportional coefficient by a factor of 10. We propose effective implementations in the case of Multi-Dimensional radix-2 FFT for the recent RISC workstation and the vector-type supercomputer, respectively. Key words: FFT, Multi-Dimension, Radix-2, RISC, vector super computer PACS: c Computational techniques Preprint submitted to Elsevier Preprint 8 June 1999

2 1 Introduction By using the Fast Fourier Transform (FFT)[1] algorithm, we can do the Discrete Fourier Transform DFT within the calculation time in proportion to N log N, where N is system size, theoretically. But actually, its proportional coefficient depends fairly on implementation, and sometimes, depends even on the system size N. For example, recent RISCwork station has fast small cache memory and slow large main memory, then the performance of a small calculation is definitely better than that of a large calculation. When one intend to do multi-dimensional FFT, the system size grows up so rapidly with increase of length scale. So the one suffers from long calculation time. Our motivation is how we can do massive multi-dimensional FFT effectively. Because there are so many precise articles on FFT, we begin with just giving a few definitions and explanations proper to Multi-Dimensional radix-2 FFT. Suppose the transforming data is stored in the array X, and its dimension is as follows (Fortran like representation): DIMENSION X(0:N 1-1,0:N 2-1,...,0:N m -1) where N j =2 j ; j =1; 2; :::; m. Nextwe define DFT of X as X! k 1n 1 N 1! k 2n 2 N 2 n 1 ;n 2 ;:::;nm :::! kmnm Nm X(n 1;n 2 ; :::; n m ); (1) 2

3 where! N j def =exp( 2ßi N j ): (2) This operation is easily decomposed as follows: define N = N 1 N 2 :::N m, Q j = N Nj,and i) Do 1D-FFT on j-th index n j for each (n 1 ;:::;n j 1 ;n j+1 ;:::;n m ). ii) Do the operation i) on each j 2f1;:::;mg. So, one important point for fast calculation is the effective implementation of 1-dimensional FFT on multiple data sets. 2 Strategy for Effective Calculation 2.1 On RISC Workstation As written in previous section, recent RISCworkstations have fast but small cache (primary, secondary and, sometimes, third level) memory and slow but large main memory. Soitiswell known that locality of the operation is keypoint of effective calculation. The approach to this problem of the famous FFT library called FFTW" [2,3], is quite smart. The recursive call is employed to access main memory hierarchically. This technique is very effective in the case that the total amount 3

4 of data is not so much greater than the cache size. But in the multi-dimensional case, the size of data grows up so rapidly that the performance of FFTW in the 64x64x64 system is only 1 4 Figure 5). of the performance in 32x32x32 system (See Our strategy is based on a more simple idea, which everyone might try at the first time. The basic idea is that just copying 1-dimensional data to small work space and do 1-dimensional FFT, then put back to original array. This must be good at using cache memory effectively, because 1-dimensional data is small enough to be stored in cache memory usually. In other word, because 1-dimensional FFT is local enough compared with the size of cache memory. But, unfortunately, this simple idea does not succeed in most cases, because of incredibly slow memory copy between main memory and cache memory. We thought about the reason why the memory copying is so slow, and find out the way of reducing the time for copying. To understand this slow copying phenomenon, one must remind the architecture of recent RISC Micro Processing Units (hereafter MPU). They are doing burst transfer in order to compensate the latency of main memory access. So, non-contiguous block transfer takes long time in comparison with fast contiguous block transfer. To illustrate the situation, suppose 2-D data is given as follows : DIMENSION A(M,N) 4

5 Required Data A(0,0) waste A(0,1) A(0,2) waste burst transfer burst transfer Fig. 1. Illustration of burst transfer in the worst case: block size of burst transfer is equivalent to the size of 4 data, and only the first datum in each block is required. For the first dimension, 1-dimensional data fa(i; j)ji = 1;:::;Mg for any j are located on the contiguous block. But for the second dimension, data fa(i; j)jj = 1;:::; N)g for any i are aligned with stride M. Guess block sizeof burst transfer (the size of cache line) is equivalent to 4 data and find out that copying fa(i; j)jj =1;:::;N)g to work space causes that 3 4 of transferred data will be disposed (Figure 1). This means that the effective transfer rate will be degraded to the 1 4 of the theoretical maximum. This is a fatal penalty. In order to avoid this penalty, we have re-organized the loop structure such as explained below and in Figure 2 in the case of the second dimension. 1) Determining L, the size of work space (hereafter we also call L window size"), so that L is not greater than the size of cache memory. 2) Determining P, which stands for how many sets of 1-dimensional data are treated in the 1-dimensional FFT subroutine for multiple data sets. Given the size of 1-dimensional data M, thenp = L M 3) Copying multiple 1-dimensional data sets to the work space and doing mul- 5

6 tiple 1-dimensional FFT's as follows: do i=1, N, P do j=1, M do k=1, P W((j-1)*P+k)=A(i+k-1,j) end do end do ( calling subroutine for $P$ times 1-dimensional FFT ) do j=1, M do k=1, P A(i+k-1,j)=W((j-1)*P+k) end do end do end do Following this method, one data sets concerning 1-dimensional FFT is not allocated on contiguous region, but is aligned with stride P. Instead, P sets of 1-dimensional data are stored on cache entirely. Then the MPU need not access to slow main memory until P times 1-dimensional FFT completes. Because recent MPU's have generic registers, desirable number of temporary variables is So we have used upto 16 point butterfly operation on multiple data sets. 6

7 MSB(Most Significant Bit) LSB(Least Significant Bit) z axis y axis x axis MSB LSB Cache Size (216 data) MSB LSB FFT Size (27) Multiplicity (29) Merging into One Work Space (216 data) Fig. 2. Illustration of Cache Scheme We does not used self-sorting in-place algorithm such asgpfa[5 8], in spite of its good property that memory access is minimum. Oncewe tried this sophisticated algorithm and found that it needs 2Np 2 (factor 2 is for complex calculation) temporary variables to do N p point butterfly operation, because GPFA exchanges lowest log 2 N p bits and highest log 2 N p bits in the array index during N p point butterfly operation. Instead of introducing self-sorting in-place algorithm, we have done the reversed bit ordering at the same time as putting back to the original array from work space. This implementation reduces the penalty of the reordering. 7

8 2.2 On Vector Type Super Computer Most of vector type super computers are designed under completely different policy from RISC workstation. A major difference is that super computers have large number of sets of signal lines between MPU and main memory,which are called bank". This structure can compensate the latency of main memory access, especially for contiguous access. Generally speaking, super computer does not need cache memory for the sake of its multi-bank structure. Another difference is the origin of the name of vector type". They havevector registers and the pipelines for vector registers. The weak point ofmulti-bank structure appears when the largest common divisor of B and S does not equal to 1, where B means the the number of bank", and S means the stride" of memory access. Memory accesses concentrate on specific banks and the bandwidth decrease. This performance decrease is called bank conflict". There are two well known strategy for effective use of vector type super computers. One is that the inmost loop count should be long enough to use vector registers effectively. The other is to avoid the allocation of the array variables which have the sizes of powersof2.thisisbecausethenumbers of banks equals to powers of 2 on most super computers. Allocating 2 n +1blocks in- 8

9 stead of 2 n blocks, then we find bank conflict does not occur moreover. In present implementation, we have maximized the in-most loop count, but doesn't used the latter method. In order to maximize the loop count of the inmost loop, we are doing 1- dimensional FFT on the dimension whose index occupies the Most Significant Bit (hereafter MSB) in the addressing space of given array variables. Of course, we are also doing transposition of the data, before doing 1-dimensional FFT on the next dimension. For example, consider the case that array X(i,j,k) (i; j; k 2 f1; 2;:::;32g) isgiven and we have to do FFT on array X.We do FFT on the last index k at beginning, then inmost loop is for (i; j). So, the inmost loop count is32 32 = This is an enough long loop for vector pipelines. The next FFT will be done on the second index j. Butif we don't reorganize the array variables, the total loop has rather complex structure. So before doing the next FFT, we transpose the array suchthat X(i; j; k)! X(k; i; j). In this strategy, transposing array Xmight cause the fatal bank-conflict penalty, because the size of arrayisapower of 2 and the numberofbankisalsoapower of 2. But we found special array transposition algorithm can resolve this problem. In order to avoid the bank conflict, wehave reorganized the loop structure as Figure 3. By using this algorithm, stride of accessing data does not equal to a power of 2. Therefore, the largest common divisor equals to 1 and no 9

10 z axis y axis x axis MSB LSB SOURCE(P,Q) P:(x,y), Q: z DESTINATION(Q,P) ADDRESS ADDRESS Fig. 3. Illustrating Transposition of Array without Bank Conflict. Numbers in the each cell describes the order of data transfer bank-conflict occurs. Typical super computer have multiple pipelines, then we have used upto 16 point butterfly operation on multiple data sets. Moreover, we are employing self-sorting but not in-place algorithm. It means as follows. We prepare the 2 arrays which have the same size, one for original data and the other for intermediate result. And we do bit-reverse ordering at the same time as butterfly operation. This point is similar to GPFA. But not having the in-place property causes the reduction of the number of temporary variables and effective 10

11 use of vector registers. As mentioned in the previous subsection, GPFA needs temporary variables to do 16 point butterfly operations and, our simple implementation just needs Results and Discussions Throughout present paper, we used FLOPS" as effective FLOPS", as same as the authors of FFTW did. Namely, we count the butterfly operation as 10 FLOP's, although actual averaged FLOP counts are smaller than On RISC Workstation First of all, we must fix the window size". Figure 4 clearly shows performance increases with the window sizeuntil the window size reaches 2 15 and decreases rapidly when the window size exceeds 2 16.Thisisvery reasonable result considering the machine has 2MB cache memory and window size of 2 16 corresponds to 2MB memory (notice 2 16 (size of complex) = 2MB). Figure 5 shows the 3D-FFT results in comparison with other algorithms. FFTW shows very good performance at each of smaller size result, but not at greater size. Present algorithm does not shows performance degradation until the biggest size. 11

12 250 Window size dependence of 3D-FFT performance MFLOPS 200 Performance(MFLOPS) Window Size (16byte unit) Fig. 4. Window size dependence of 3D-FFT Performance. Measurement isdone on the workstation model J2240 of Hewlett-Packard company (PA-8200, 236 MHz, 2MB/2MB instruction/data cache). Array size is 128x128x On Vector Type Super Computers On vector type super computers, the performance shows almost no dependence on the system size. So we justshows the result for the size of 128x128x128 array. Results for two systems, Hitachi S3800 and NEC SX-4, are shown in the table 3.2. In tables 3.2 MFFT means the library called MFFT" [9,10]. But original code is written for CRAY compiler, all the directives to the compiler is rewritten by the present author so as to match to a compiler on each machines. FFTW is not designed for vector type super computer. So this is a clear example which shows ignorance of targets structure causes unexpected results. ASL 12

13 Array Dimension v.s FFT Performance FFTW GPFA Present work (scalar version) Performance (MFLOPS) x4x4 8x8x8 16x16x16 32x32x32 Array Size 64x64x64 256x64x32 16x1024x64 128x128x128 Fig. 5. Size dependence of 3-dimensional FFT Speed. Measurement is done on the workstation model J2240 of Hewlett-Packard company (PA-8200, 236 MHz, 2MB/2MB instruction/data cache). means multiple 1-dimensional FFT routine from the library called ASL", which is issued by NEC. ASL does not include the multi-dimensional FFT code. So we attached our special transposition code ( a part of present work ) to the 1-dimensional FFT subroutine for multiple data in ASL. GPFA means original GPFA library. But original code is written for CRAY compiler, all the directives to the compiler is rewritten by the present author so as to match to a compiler on each machines. GPFA(+1) means the GPFA with the special array allocation such that leading dimension has the size of 2 n +1. GPFA(TRNS) means the GPFA with special transposition code (a part of 13

14 NEC SX-4 size 128x128x128 Hitachi S3800 size 128x128x128 Algorithm MFLOPS Present(vector version) 1: Algorithm MFLOPS GPFA 3: Present(vector version) 3: GPFA(+1) 1: GPFA 3: GPFA(TRNS) 1: GPFA(+1) 2: MFFT 1: GPFA(TRNS) 2: FFTW 2: ASL 1: Table 1 Results of calculation speed measurement on vector type super computers. present work). Comparing GPFA and GPFA(TRNS), then we found that performance gain by the improvement of the transposition routine reaches nearly factor 10. It is difficult to compare GPFA and present algorithm directly, because GPFA can do mixed-radix FFT and the latter cannot. And the result depends on the type of machine. If one needs only radix-2 FFT, present algorithm have some advantage, at least, on these machines. 14

15 4 Conclusion On RISC computers, locality of the algorithm is the key-point for effective calculations. FFT has the property that each resultant datum needs global information of original data. Therefore, it is hard to increase the locality of algorithm. But this situation changes when one needs multi-dimensional FFT, because typical multi-dimensional FFT is a massive collection of small onedimensional FFT. We introduced the control parameter for the locality of algorithm, named window size", and are succeeded in reducing the penalty of large calculation. On vector type super computers, effective use of vector pipelines and multi bank structure are the key-points for effective calculations. We construct a effective code using upto 16 point butterfly operation (i.e. radix-16 algorithm) and special matrix transposition routine for the matrix size of power of 2. Acknowledgements The present author thanks M. Frigo and S. G. Johnson who wrote bench FFT"[4] (and also FFTW"). Bench FFT" isvery impressive and useful for comparing FFT algorithms. Present work has been done on the computer facility of Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute, Institute of Molecular Science, 15

16 Computer Centre of University of Tokyo. Author thanks each organization a lot for use of each facility. References [1] J. W. Cooley and J. W. Tukey, An Algorithm for the Machine Calculation of Complex Fourier Series, Math. Comp. 19 (1965) [2] M. Frigo and S. G. Johnson, The Fastest Fourier Transform in the West, MIT- LCS-TR-728(MIT technical report) (1997) [3] [4] [5] C. Temperton, Self-sorting mixed-radix Fast Fourier Transforms, J. Comput. Phys. 52 (1983) 1 23 [6] C. Temperton, Implementation of a Self-Sorting In-Place Prime Factor FFT Algorithm, J. Comput. Phys. 58 (1985) [7] C. Temperton, Self-Sorting In-Place Fast Fourier Transforms, SIAM J. Sci. Stat. Comput. 12 (1991) [8] C. Temperton, Generalized Prime Factor FFT Algorithm for Any N =2 p 3 q 5 r, SIAM J. Sci. Stat. Comput. 13 (1992) [9] A. Nobile and V. Roberto, EFFICIENT IMPLEMENTATION OF MULTIDIMENSIONAL FAST FOURIER TRANSFORMS ON A CRAY X- 16

17 MP, discreat Fourier transforms, Comput. Phys. Communs. 40 (1986) [10] A. Nobile and V. Roberto, MFFT: A PACKAGE FOR TWO- AND THREE- DIMENSIONAL VECTORIZED DISCRETE FOURIER TRANSFORMS, Comput. Phys. Communs. 42 (1986)

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan Automatic Performance Tuning in the UHFFT Library Dragan Mirković 1 and S. Lennart Johnsson 1 Department of Computer Science University of Houston Houston, TX 7724 mirkovic@cs.uh.edu, johnsson@cs.uh.edu