A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems

Size: px

Start display at page:

Download "A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems"

Job Parrish
5 years ago
Views:

1 A Parallel Implementation of the 3D NUFFT on Distributed-Memory Systems Yuanxun Bill Bao May 31, Introduction The non-uniform fast Fourier transform (NUFFT) algorithm was originally introduced by Dutt and Rohlin [1] to generalize the FFT algorithm to nonequispaced data on the interval [ π, π]. In d dimensions, the NUFFT algorithm can achieve a complexity of O ( M d log M + N(log 1 ɛ )d), where ɛ is the precision of computation, M is the number of Fourier modes in each dimension and N is the total number data points. The NUFFT algorithm arises in a variety of application and we refer the reader to the discussions in [2, 3]. In this report, we focus on the parallel implementation of the NUFFT on large distributed memory systems. We note that there has been recent developments on parallelizing the NUFFT on massively parallel distributed-memory systems: the PNFFT library [5]. Different from the PNFFT library which is based on the NFFT library in C, our implementation is based on the P3DFFT library [4] in Fortran. Due to the time constraint and scope of the project, our implementation is not yet optimized for performance and is restricted to the transform from the non-uniform physical domain to the uniform frequency domain (Type-1 transform). 2 The NUFFT In 3D, the type-1 NUFFT is mainly concerned with evaluating the sum: F (k 1, k 2, k 3 ) = 1 N N 1 f j e i (k 1,k 2,k 3 ) x j, (2.1) j=0 1

2 where {x j } N j=1 are non-uniformly distributed sources in the domain [ π, π]3, k i { M 2,... M 2 1}, i = 1, 2, 3, and the strength of the source x j is f j = f (x j ). Note that a direct evaluation of the sum (2.1) would result in a total number of O(NM 3 ) operations. Typically, when N M, direct evaluation of the sum is computationally intractable. The 1D NUFFT algorithm can be summarized in three steps: 1. Gridding: for each source x j, spread the strength f j to its nearby M s oversampled regular grid points in both directions by convolving with a gaussian function. The reason we use the gaussian function is that it can be written in terms of a tensor product in higher dimensions. The number of regular oversampled grid points is typically set to be M r = 2M. To be more specific, the contribution due to source x j to the target m is f τ (2π/M r (m + m )) = f j e (x 2π(m+m )/Mr) 2 /4τ, (2.2) where 2πm/M r is the nearest regular grid point of the source x j and M s < m M s. 2. FFT: take the FFT of f τ and get F τ (k). π 3. Deconvolution: F (k) = τ ek2τ F τ (k). In practice, for ɛ = 10 12, we set M s = 12 and τ = 12/M 2. The 1D NUFFT algorithm can be easily generalized to higher dimensions. In d dimensions, the gridding step takes O(24 d N) exponential evaluations, the FFT step takes O(M d log M) operations and the deconvolution step takes O(M 3 ) multiplications. When N M and M is large, the runtime of a sequential NUFFT becomes quite expensive. 3 Distributed Memory Parallelism We choose a distributed-memory parallelism for the NUFFT, since step 1 and 3 are localized and there are existing libraries (FFTW, P3DFFT) to compute the FFT on distributed memory systems. In order to run the NUFFT on massively parallel distributed memory systems, for example, the Stampede, we employ a 2D domain decomposition (pencil-shaped) approach. To be more precise, the x-direction of the computational domain is local in a processor, while the y- and z-direction of the domain are distributed among a 2D grid 2

Inter-processor communication is necessary when nearby regular grid points of a source lie outside the computational domain of that processor (Figure 3a).

3 Figure 1: An illustration of a 4 4 2D grid of processors. Each processor has 8 neighbors. of processors (Figure 1). Each processor is responsible for a pencil-shaped chunk of the computational domain (Figure 2). For each processor, we loop through all the sources and perform the gridding step. Inter-processor communication is necessary when nearby regular grid points of a source lie outside the computational domain of that processor (Figure 3a). To carry out this procedure efficiently, we extend the local computational domain of each processor to include a halo of ghost arrays. Therefore, gridding can be done locally first, and then, we send the ghost arrays to the corresponding neighboring processors (Figure 3b). Once every processor has completed ghost array exchanges, we call the parallel version of FFT provided by the P3DFFT library. The deconvolution step can be done locally within each processor. This completes the description of our parallel implementation of the NUFFT algorithm. We discuss the details of inter-processor communications next. 3

Figure 2: The 2D domain decomposition of

(a) (b) Figure 3: (a) An illustration of

4 Figure 2: The 2D domain decomposition of the computational domain (pencil-shaped). (a) (b) Figure 3: (a) An illustration of a source point whose neighboring regular grid points lie outside the computational domain of a processor. (b) An illustration of ghost arrays being sent to their corresponding neighboring processors. 4

The order of which ghost arrays are transferred among processors is designed to avoid any hang in the runtime.

5 4 Inter-Processor Communications We are now ready to discuss how inter-processor communications are carried out in our implementation. After each processor completes the gridding step, we need to send the corresponding ghost arrays to its eight neighbors. The order of which ghost arrays are transferred among processors is designed to avoid any hang in the runtime. As an illustration, in Figure 5, we divide the 2D processor grid into two groups: odd-row and even-row processors. We demonstrate how North-South communications are carried out. First, then even-row processors send data to their North neighbors (MPI Send) and wait to receive from their North neighbors (MPI Recv). Meanwhile, the oddrow processors receive data from their South neighbors and send data to the South (Figure 4a). Next, the odd-row processors send data to the North and wait to receive from the North, and the even-row processors receive data from the South and send data to the South (Figure 4b). This completes all North-South data exchanges. The E-W, NE-SW, SE-NW communications can be carried out in a similar fashion. We note that the implementation of inter-processor communication discussed here is not optimal but it effectively avoids hang in runtime. (a) (b) Figure 4: An illustration of North-South data exchange among even- and odd-row processors. 5

6 5 Results We discuss the strong and weak scaling of our parallel implementation of the 3D NUFFT algorithm. For strong scaling, we consider M = 1024 and N = billion sources in [0, 2π] 3. The oversampled grid resolution is set to be 2M. Other parameters are ɛ = 10 12, M s = 12 and τ = 12/M 2. We run on the Stampede with 512, 1024, 2048 and 4096 processors. In Figure 5a, we plot the average time per processor for the total computation, the gridding step, the MPI communication and the FFT versus the number of processors on a log-log scale. First, we notice that cost of the algorithm is dominated by the gridding step. As the number of processors doubles, the total computation time and the gridding time is halved, which shows the strong scaling of our implementation. The MPI communication time, though playing a minor role in terms of cost, also scales strongly. The reason why FFT does not scale strongly is that, after dividing the domain into pencilshaped arrays, they are too small for P3DFFT to show strong scaling. As a remark, we believe that our implementation will continue to scale strongly if more processors can be requested (the maximum normal queue size is 4096 on Stampede). It is worth mentioning that, if the same input data were to run on a single processor, not to even mention the data would fit into the memory, it would take almost 2 days as compared to 40 seconds for 4096 processors. For weak scaling, we keep the work load of each processor the same. For our current implementation, we can only compare input data differed by a factor of 8. We compare 512 vs 4096 processors with sources per processor, and 256 vs 2048 with sources per processor. Figure 5b shows that the time per processor required to do each task is almost the same for 512 vs 4096 processors, and 256 vs 2048 processors, which shows the weak scaling of our implementation. 6 Conclusion In this project, we present a parallel implementation of the 3D NUFFT algorithm on distributed memory systems. Our implementation features a 2D domain decomposition approach, and is able to scale both weakly and strongly on a large distributed memory system (eg. Stampede). Future work includes optimization on memory access and data storage, implementing the type-2, 3 transform, and comparing to a GPU implementation. 6

7 (a) (b) Figure 5: (a) Strong scaling of our parallel implementation of 3D NUFFT. (b) Weak scaling of our parallel implementation of 3D NUFFT. 7

8 References [1] A. Dutt and V. Rokhlin. Fast Fourier transforms for nonequispaced data. SIAM J. Sci. Comput., 14(6): , [2] Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast Fourier transform. SIAM Rev., 46(3): , [3] June-Yub Lee and Leslie Greengard. The type 3 nonuniform FFT and its applications. J. Comput. Phys., 206(1):1 5, [4] Dmitry Pekurovsky. P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions. SIAM J. Sci. Comput., 34(4):C192 C209, [5] Michael Pippig and Daniel Potts. Parallel three-dimensional nonequispaced fast Fourier transforms and their application to particle simulation. SIAM J. Sci. Comput., 35(4):C411 C437,

Mesh Generation. Quadtrees. Geometric Algorithms. Lecture 9: Quadtrees

Mesh Generation. Quadtrees. Geometric Algorithms. Lecture 9: Quadtrees Lecture 9: Lecture 9: VLSI Design To Lecture 9: Finite Element Method To http://www.antics1.demon.co.uk/finelms.html Lecture 9: To Lecture 9: To component not conforming doesn t respect input not well-shaped