Parallelization of FFT in AFNI

Parallelization of FFT in AFNI Huang, Jingshan and Xi, Hong Department of Computer Science and Engineering University of South Carolina Columbia, SC huang27@cse.sc.edu and xi@cse.sc.edu Abstract. AFNI is a widely used software package for medical image processing. However it is not a real-time system. We are trying to make a parallelized version of AFNI, which will be nearly real-time. As the first step of this enormous task, we parallelize the FFT part of AFNI. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. 1 Introduction AFNI is a widely used software package for medical image processing. However, there is a big drawback of this system: it is not a real-time system, which to some extent has impaired the application in the related areas. Our ultimate goal is to make a parallelized version of AFNI, which will have the property of nearly realtime. As our first step, we would like to parallelize the FFT part of AFNI. One of the reasons that we choose FFT is that inside AFNI, FFT is extensively called by other functions. In fact, FFT is a fundamental function in AFNI. Therefore, to obtain a parallel version of FFT in AFNI will definitely have great significance in achieving our final goal. 2 Background 2.1 AFNI AFNI stands for Analysis of Functional NeuroImages. It is a set of C programs (over 1, source code files) for processing, analyzing, and displaying functional MRI (FMRI) data - a technique for mapping human brain activity. AFNI runs on Unix+X11+Motif systems, including SGI, Solaris, Linux, and Mac OS X. AFNI is an interactive program for viewing the results of 3D functional neuroimaging. It can overlay the (usually) low-resolution results of functional brain scans onto higher resolution structural volume data sets. By marking fiducial points, one may transform the data to the proportional grid (stereotaxic coordinates) of Talairach and Tournoux [1]. Time-dependent 3D volume data sets can also be created and viewed. In addition, there are some auxiliary programs provided for combining and editing 3D and 3D+time functional data sets. Although being widely used in the area of medical image processing, AFNI is not a real-time software package, which in some degree influences its application in some cases. 2.2 FFT[3] FFT, Fast Fourier Transform, is a discrete Fourier transform algorithm from discrete time domain to discrete spatial domain. It is simply a method of laying out the computation, which is much faster for large

values of N, where N is the number of samples in the sequence. FFT is an ingenious way of achieving rather than the DFT's clumsy N 2 timing. To carry out FFT, we will go backwards, starting with the 2-point transform. k 1 k V[k] = W2 v[] + W2 v[1], k=, 1, with the two components be: V[] = W v[] + W v[1] = v[] + W v[1] 2 2 2 1 1 2 v[] + W2 v[1] = v[] + W2 V[1] = W v[1] Here V is a column vector which will store the FFT value of the input array v, and W 2 is the principal 2nd root of unity, which is equal to e i. In the following text, we use W N to represent the principal nth root of unity. We can represent the two equations for the components of the 2-point transform graphically using the, so called, butterfly v[] V[] W 2 v[1] 1 W 2 Fig. 1. Butterfly calculation Furthermore, using the divide and conquer strategy, a 4-point transform can be reduced to two 2-point transforms: one for even elements, one for odd elements. The odd one will be multiplied by W k 4. n Diagrammatically, this can be represented as two levels of butterflies. Notice that using the identity W N/2 = W 2n N, we can always express all the multipliers as powers of the same W N (in this case we choose N=4). v[] V[] V[1] v[2] 2 1 V[1] v[1] 2 V[2] v[3] 2 Fig. 2. Diagrammatical representation of the 4-point Fourier transform calculation In fact, all the butterflies have similar form: s W N 3 V[3] s N / 2 W + N Fig. 3. Generic butterfly graph

s+n/2 s N/2 This graph can be further simplified using the identity: W N = W N W N = -W N s, which is true because N/2 W N = e -2 i(n/2)/n = e -i = cos(- ) + isin(- ) = -1, and here is the simplified butterfly: W -1 s N Fig. 4. Simplified generic butterfly Using this result, we can simplify the 4-point diagram as: v[] V[] v[2] -1 V[1] v[1] -1 V[2] v[3] 1-1 -1 Fig. 5. 4-point FFT calculation V[3] This diagram is the essence of the FFT algorithm. The main trick is that we do not calculate each component of the Fourier transform separately. That would involve unnecessary repetition of a substantial number of calculations. Instead, we do our calculations in stages. At each stage we start with N (in general complex) numbers and "butterfly" them to obtain a new set of N complex numbers. Those numbers, in turn, become the input for the next stage. The calculation of a 4-point FFT involves two stages. The input of the first stage are the 4 original samples. The output of the second stage are the 4 components of the Fourier transform. Notice that each stage involves N/2 complex multiplications (or N real multiplications), N/2 sign inversions (multiplication by -1), and N complex additions. So each stage can be done in O(N) time. The number of stages is log 2 N (which, since N is a power of 2, is the exponent m in N = 2 m ). Altogether, the FFT requires on the order of O(Nlog 2 N) calculations. The new complexity comes from the following derivation. T (N) = 2T(N / 2) + c N = 2 [2T(N / 4) + c N / 2] + c N = 4 T(N / 4) + 2 c N =... k k = 2 T(N / 2 ) + k c N When 2 k k = N, T(N / 2 ) = 1, and k = log2 N, so T(N) = N + log2 N c N = O(N log2 N). Moreover, the calculations can be done in-place, using a single buffer of N complex numbers. The trick is to initialize this buffer with appropriately scrambled samples. For N=4, the order of samples is v[], v[2], v[1], v[3]. In general, according to our basic identity, we first divide the samples into two groups, even ones and odd ones. Applying this division recursively, we split these groups of samples into two groups each by selecting every other sample.

In a summary, the naive implementation of the N-point digital Fourier transform involves calculating the scalar product of the sample buffer (treated as an N-dimensional vector) with N separate basis vectors. Since each scalar product involves N multiplications and N additions, the total time is proportional to N 2 (in other words, it is an O(N 2 ) algorithm). However, it turns out that by cleverly re-arranging these operations, one can optimize the algorithm down to O(Nlog 2 N), which for large N makes a huge difference. The idea behind the FFT is the standard strategy to speed up an algorithm, i.e. to divide and conquer, which will break up the original N point sample into two (N/2) sequences. This is because a series of smaller problems is easier to solve than one large one. The DFT requires (N-1) 2 complex multiplications and N(N- 1) complex additions as opposed to the FFT's approach of breaking it down into a series of 2 point samples which only require 1 multiplication and 2 additions and the recombination of the points which is minimal. 2.3 FFT in AFNI Inside AFNI, there are many functions which will use the FFT algorithm to perform the Fourier Transform. In fact, FFT is a very fundamental function in AFNI. As our first step to parallelize AFNI, we will implement a parallelized version of FFT here. 2.4 MPI MPI, Message-Passing Interface [2], is the most widely used approach to develop a parallel system. Rather than specifying a new language (and hence a new compiler), it has specified a library of functions that can be called from a C or Fortran program. The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing. Message passing function, simply a function that explicitly transmits data from one process to another, is a powerful and very general method of expressing parallelism. Message-passing programs can be used to create extremely efficient parallel programs, and message passing is currently the most widely used method of programming many types of parallel computers. However, its principal drawback is that it is very difficult to design and develop programs using message passing. Indeed, it has been called the assembly language of parallel computing because it forces the programmer to deal with so much detail. The introduction of MPI makes it possible for developers of parallel software to write libraries of parallel programs that are both portable and efficient. Use of these libraries will hide many of the details of parallel programming and, as a consequence, make parallel computing much more accessible to professionals in all branches of science and engineering. 3 Method AFNI contains thousands of source code files, and our focus is on csfft.c. Inside this file, we concentrate on the function of csfft_cox() that performs the FFT and will be called by many other functions. 3.1 General Flow Chart of csfft_cox

SCLINV return fft2 fft4 fft8 fft16 fft32 csfft_cox start fft64 fft128 fft256 fft512 fft124 fft_4dec fft248 fft_4dec fft496 fft_4dec fft8192 fft_4dec fft16384 fft32768 fft_4dec fft_4dec 3n 5n fft_3dec fft_5dec Fig. 6. Flow chart of csfft_cox 3.2 Analysis of fftn functions We analyze the hard-coded part of five fftn functions, i.e. fft2(), fft4(), fft8(), fft16() and fft32() and translate the original code into the parallelized version. Here we take fft4() as an example. xcx[].r = (xcx[].r + xcx[2].r ) + (xcx[1].r + xcx[3].r ) xcx[2].r = (xcx[].r + xcx[2].r ) - (xcx[1].r + xcx[3].r ) xcx[1].r = (xcx[].r - xcx[2].r ) - (xcx[1].r - xcx[3].r ) * csp[2].i xcx[3].r = (xcx[].r - xcx[2].r ) + (xcx[1].r - xcx[3].r ) * csp[2].i For other hard-coded functions, the readers can refer to the Appendix A. 3.3 One-level parallelization There are several options for us to parallel the csfft_cox() function. At present, we adopt the one-level parallelization method, that is, when fft496() calls fft124() and when fft8192() calls fft248().

fftest.c is a source code file written solely for the purpose of testing FFT speed. It will calculate the CPU time, wall clock time (elapsed time), bytes of data being processed and MFLOPs. First of all, we need to set up the related initiation functions of MPI in the main function of fftest.c file. MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &np); MPI_Comm_rank(MPI_COMM_WORLD, &rank); Then in the csfft_cox() function, we deal with 496 and 8192 cases (for the detail information, refer to Appendix B). Finally, back in the main function of fftest.c file, we need to terminate the MPI. MPI_Finalize(); 4 Experiment Results In the following charts, we will show some of our experiment results. Notice that the first row of each chart has three parameters, which will be the FFT length, number of iterations and the dimension of vector (how many FFT per iteration) respectively. 496 * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 3.2.4 3.3 2 Processors 3.69 2.2 2.93 16.27 2.2 4 Processors 3.88 1.77 1.57 1.45 5.33 23.53 1.6 496 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 7.93 1.35 13.83 2 Processors 8.54 4.99 7.47 37.7 4.99 4 Processors 7.77 2.88 3.23 3.12 11.15 5.21 3.8 496 * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 14.55 1.4 2.52 2 Processors 19.22 8.54 13.89 89.31 8.54 4 Processors 15.65 5.5 5.28 5.88 22.39 13.88 5.55 496 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 74.31 9.38 114.53 2 Processors 75 46.66 64.67 363.83 46.66 4 Processors 82.48 29.4 29.93 32.71 11.58 518.1 3.56

Comprehensive Chart 1 for FFT 496 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.1.17.1.13.1.17.1.13 2 Processors.51.55.5.53.79.87.72.86 4 Processors.61.66.69.63 1.37 1.44 1.43 1.34 Comprehensive Chart 2 for FFT 496 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.97.57.71.65.97.57.71.65 2 Processors.35.36.31.33.23.23.22.21 4 Processors.37.34.31.34.15.15.15.16 Comprehensive Chart 3 for FFT 496 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors 1.58 1.59 1.7 1.59.87.93.76.99 4 Processors 2. 2.58 2.62 2.43.82 1.2.93.9 8192 * 5, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 7.29.15 7.61 2 Processors 1.43 5.7 7.11 28.72 5.7 4 Processors 8.74 3.76 3.98 3.46 1.63 38.69 3.73 8192 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 17.45.33 18.27 2 Processors 19.99 13.44 13.79 74.87 13.44 4 Processors 19.8 7.59 7.57 7.51 21.17 86.57 7.56 8192 * 2, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 3.7.9 34.22 2 Processors 39.37 23.13 27.29 142.89 23.13 4 Processors 35.77 15.2 15.15 13.75 42.9 169.48 14.7 8192 * 1, * 1 CPU Time System Time Elapsed Time Average CPU Time Original Code 152.72 3.81 166.58 2 Processors 27.98 114.16 134.52 698.48 114.16 4 Processors 197.67 71 78.35 75.97 228.45 96.3 75.11

Comprehensive Chart 1 for FFT 8192 System Time / Total CPU time System Time/ Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.2.2.3.2.1.17.1.13 2 Processors.44.41.44.42.68.69.69.65 4 Processors.53.51.54.54 1.22 1.11 1.2 1.16 Comprehensive Chart 2 for FFT 8192 Total CPU time / Elapsed Time Rank CPU time / Elapsed Time Iterations 5, 1, 2, 1, 5, 1, 2, 1, Original Code.96.96.9.92.96.96.9.92 2 Processors.56.45.44.46.36.27.28.3 4 Processors.52.48.47.47.22.22.21.22 Comprehensive Chart 3 for FFT 8192 sequential CPU time / Average CPU time (except for rank ) sequential CPU time / Rank CPU time Iterations 5, 1, 2, 1, 5, 1, 2, 1, 2 Processors 1.28 1.3 1.33 1.34.7.87.78.73 4 Processors 1.95 2.31 2.9 2.3.83.91.86.77 5 Discussion 5.1 The correctness of our code In our driver program, fftest.c, which is used to test the speed of FFT, we read data from a file that contains 2,48, complex numbers. After the FFT and IFFT consequently, we obtain a set of complex numbers that are almost the same as the ones in the original data file. The difference comes from the fact that both the real and imaginary part of those complex numbers are floating point numbers. This result shows that our parallelized version of FFT code functions correctly. 5.2 One-level parallelization verse multi-level parallelization In the original sequential code, the calculation has many data dependencies among the instructions. We analyze the code and find out the formula for each element. Now there is no interlinking among different elements, and that could form the basis of the parallelization. In our first attempt to implement the parallelized code, we try to deal with the most basic functions, i.e. 5 fft functions: fft2, fft4, fft8, fft16 and fft32. According to our original thought, those functions are the ones that will be called by other functions hundreds (maybe thousands) of time. However, after the experiment we found out that the cost spend in communication among all participating nodes and other overhead will kill the speedup obtained by the parallelization itself.

Therefore, we did not parallel those basic functions. Instead, we change our focus to the higher level, that is, when fft496() calls fft124() and when fft8192() calls fft248(). In addition, each time we use 4 processors to distribute the calculation workloads. Also, in our implementation of parallelization, we did not adopt the multi-level technique. The advantage of multi-level parallelization is obvious and straightforward: it will parallel the sequential code to the most possible extent. However, the disadvantage is that this kind of parallelization greatly increases the difficulties in implementation. In addition, the communication cost and other overhead cannot be neglected. In our future work, we might consider the two-level parallelization technique in the implementation. 5.3 Analysis of speedup There are two kinds of time, which we have taken into account in analyzing our experiment result: CPU Time and Wall Clock Time (Elapsed Time). The former is the time spend in the calculation part of the code. The latter is the total elapsed time from the viewpoint of the users. It will be great if both of the above two kinds of time are decreased. However, elapsed time is much more non-deterministic than CPU time and is always out of control of our code. Furthermore, different strategies of assigning tasks among processors will have different results. We have tried two methods of distributing the workload. One is to distribute them into two machines (nodes), the other is to use four machines Both methods will assign the workload evenly among the nodes. In addition, before each running of our program, we will check and make sure that the nodes on which our code is running will have a CPU occupied ratio of less than 2 percent. From our experiment data, we find that in both of the two distribution strategies, CPU time will be decreased to different levels. When we use the data set of 496/2,/1 in four machines, which means the FFT length of 496, 2, iterations and one FFT for each iteration, we will get the maximum speedup in CPU time with a factor of around 2.62. However, here we only consider the average CPU time of the nodes other than the head one. The CPU time on the head node does occupy a certain fraction of the wall clock time and is always a little bit more than the sequential CPU time. We believe that there exist some problems regarding the overhead on the head node because that node is the machine that will distribute the original data and gather the result for further processing such as recombining. Possible reasons include the networking bottleneck, the inefficient way of sending and receiving messages among different nodes, and maybe too many iterations as well. As for wall clock time, both of the two distribution methods have increased it. In addition, the total CPU time is only 31 to 37 percent of wall clock time (for FFT length of 496) and 47 to 52 percent of wall clock time (for FFT length of 8192). Therefore, it is obvious that most of the wall clock time is spent in the system itself, and we are still working in it to find out the exact reason(s). For both strategies, we did not obtain 2/4 times speedup, which is ideal speedup. The main reasons are as follows. First of all, there exist the competitions among different users in the same CPU. Each time we will choose those machines that are least occupied by other users to run our code. However we have no control on those CPUs, like operating systems do. So it could happen that during the running of our process on one CPU, that specific CPU will be chosen by other users and hence more occupied than it appeared to be. Secondly, due to the existing communication cost and some other overhead, it is impossible to obtain the ideal speedup in the real machines. In fact, with the increase of the size of our data set, the speedup increases as well. However, it is not true that the speedup will always continue increasing as we feed more input data to the nodes. We believe that the reason is when the running time increases, the risk of selected nodes being occupied by other processes also increases. Therefore, there is an optimal size of data set that will obtain the largest speedup. In our experiment, the parameters of 496/2,/1 running on four nodes are the best.

6 Conclusions We have parallelized the FFT part of AFNI software package. Our method is based on MPI. The result shows that for the FFT algorithm itself, we obtain a speedup of around 2.62 in CPU time. Future work includes the parallelization of the workhorse analysis tool in AFNI, 3dDeconvolve program, which carries out multiple linear regressions on voxel time series and computes associated statistics. Furthermore, our final goal is to make a near real-time AFNI system. References 1. Jean Talairach and Pierre Tournoux: Co-Planar Stereotaxic Atlas of the Human Brain. Thieme Medical Publishers, New York (1988) 11 121 2. Peter S. Pacheco: Parallel Programming with MPI. Morgan Kaufmann Publishers, San Francisco, California (1997) 11 41 3. Quinn: Parallel Algorithm, Chapter 8 The Fast Fourier Transform. Morgan Kaufmann Publishers, San Francisco, California (1997) 198 213 Appendix A:! fft8 xcx[].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) + [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[4].r = (xcx[].r + xcx[4].r ) + (xcx[2].r + xcx[6].r ) - [(xcx[1].r + xcx[5].r ) + (xcx[3].r + xcx[7].r )] xcx[1].r = (xcx[].r - xcx[4].r ] + csp[2].i * (xcx[2].r - xcx[6].r ] + csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3].r - xcx[7].r )) xcx[5].r = (xcx[].r - xcx[4].r ) + csp[2].i * (xcx[2].r - xcx[6].r ) - csp[4].i * [(xcx[1].r - xcx[5].r ] + csp[2].i * (xcx[3 - xcx[7].r ]) xcx[2].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] + csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[6].r = (xcx[].r + xcx[4].r ] - (xcx[2].r + xcx[6].r ] - csp[5].i * [(xcx[1].r + xcx[5].r ] - (xcx[3].r + xcx[7].r )) xcx[3].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] + csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )) xcx[7].r = (xcx[].r - xcx[4].r ] - csp[2].i * (xcx[2].r - xcx[6].r ] - csp[6].i * [(xcx[1].r - xcx[5].r ] - csp[2].i * (xcx[3].r - xcx[7].r )]! fft16

xcx[].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r)) xcx[8].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] + [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[1].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[8].i * [csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r)] xcx[9].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[8].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[8].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[2].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] + csp[5].i) * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[1].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] + csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[9].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[9].i * csp[5].i * [(xcx[3].r + xcx[11].r ] - (xcx[7].r + xcx[15].r )) xcx[3].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[11].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] + csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[1].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[1].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] xcx[4].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] + csp[11].i * [(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] - csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )] xcx[12].r = [(xcx[].r + xcx[8].r ] + (xcx[4].r + xcx[12].r ]] - [(xcx[2].r + xcx[1].r ] + (xcx[6].r + xcx[14].r ]] - csp[11].i *

[(xcx[5].r + xcx[13].r ] + (xcx[1].r + xcx[9].r ]] + csp[11].i * [(xcx[3].r + xcx[11].r ] + (xcx[7].r + xcx[15].r )) xcx[5].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] + csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[13].r = [(xcx[].r - xcx[8].r ] + csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[4].i * [(xcx[2].r - xcx[1].r ] + csp[2].i * (xcx[6].r + xcx[14].r ]] - csp[12].i * [ csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[12].i * csp[4].i * [(xcx[3].r - xcx[11].r ) + csp[2].i * (xcx[7].r - xcx[15].r )] xcx[6].r = [(xcx[].r + xcx[8].r ] - (xcx[4].r + xcx[12].r ]] - csp[5].i * [(xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] + csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] - csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - (xcx[7].r + xcx[15].r )] xcx[14].r = [(xcx[].r + xcx[8].r ] - ( xcx[4].r + xcx[12].r ]] - csp[5].i * [( xcx[2].r + xcx[1].r ] - (xcx[6].r + xcx[14].r ]] - csp[13].i * [(xcx[5].r + xcx[13].r ] - (xcx[1].r + xcx[9].r ]] + csp[13].i * csp[5].i * [(xcx[3].r + xcx[11].r ) - ( xcx[7].r + xcx[15].r )] xcx[7].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] + csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] - csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ] - csp[2].i * (xcx[7].r - xcx[15].r ]] xcx[15].r = [(xcx[].r - xcx[8].r ] - csp[2].i * (xcx[4].r - xcx[12].r ]] - csp[6].i * [(xcx[2].r - xcx[1].r ] - csp[2].i * (xcx[6].r - xcx[14].r ]] - csp[14].i * [ - csp[2].i * (xcx[5].r - xcx[13].r ] + (xcx[1].r - xcx[9].r ]] + csp[14].i * csp[6].i * [(xcx[3].r - xcx[11].r ) - csp[2].i * (xcx[7].r - xcx[15].r )] Appendix B: case 496: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,124,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each

if(rank == 1) fft_4dec(mode,124,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i; if(rank == 2) fft_4dec(mode,124,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,124,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);

if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break; case 8192: // modified by Huang and Xi // calculate array aa if(rank == ) fft_4dec(mode,248,aa); // calculate arrays bb, cc and dd, then split them into two //subarrays each if(rank == 1) fft_4dec(mode,248,bb); for(i = ; i < M; i++) bb_r[i] = (bb[i]).r; bb_i[i] = (bb[i]).i;

if(rank == 2) fft_4dec(mode,248,cc); for(i = ; i < M; i++) cc_r[i] = (cc[i]).r; cc_i[i] = (cc[i]).i; if(rank == 3) fft_4dec(mode,248,dd); for(i = ; i < M; i++) dd_r[i] = (dd[i]).r; dd_i[i] = (dd[i]).i; MPI_Barrier(MPI_COMM_WORLD); // send and receive the arrays if (rank == 1) MPI_Send(bb_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(bb_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 2) MPI_Send(cc_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(cc_i, M, MPI_FLOAT,,, MPI_COMM_WORLD); if (rank == 3) MPI_Send(dd_r, M, MPI_FLOAT,,, MPI_COMM_WORLD); MPI_Send(dd_i, M, MPI_FLOAT,,, MPI_COMM_WORLD);

if (rank == ) MPI_Recv(bb_r,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(bb_i,M, MPI_FLOAT, 1,, MPI_COMM_WORLD, &status); MPI_Recv(cc_r,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(cc_i,M, MPI_FLOAT, 2,, MPI_COMM_WORLD, &status); MPI_Recv(dd_r,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); MPI_Recv(dd_i,M, MPI_FLOAT, 3,, MPI_COMM_WORLD, &status); // recombine bb, cc and dd for (i = ; i < M; i++) bb[i].r = bb_r[i]; bb[i].i = bb_i[i]; cc[i].r = cc_r[i]; cc[i].i = cc_i[i]; dd[i].r = dd_r[i]; dd[i].i = dd_i[i]; break;