Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms

Size: px

Start display at page:

Download "Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms"

Roxanne West
5 years ago
Views:

1 Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Xueqian Zhao Department of ECE Michigan Technological University Houghton, MI, 4993 ABSTRACT To facilitate full chip capacitance extraction, field solvers are typically deployed for characterizing capacitance libraries for various interconnect structures and configurations. In the past decades, various algorithms for accelerating boundary element methods (BEM) have been developed to improve the efficiency of field solvers for capacitance extraction. This paper presents the first massively parallel capacitance extraction algorithm FMMGpu that accelerates the well-known fast multipole methods (FMM) on modern Graphics Processing Units (GPUs). We propose GPUfriendly data structures and SIMD parallel algorithm flows to facilitate the FMM-based 3-D capacitance extraction on GPU. Effective GPU performance modeling methods are also proposed to properly balance the workload of each critical kernel in our FMMGpu implementation, by taking advantage of the latest Fermi GPU s concurrent kernel executions on streaming multiprocessors (SMs). Our experimental results show that FMMGpu brings X to 3X speedups in capacitance extractions for various test cases. We also show that even for small test cases that may not well utilize GPU s hardware resources, the proposed cube clustering and workload balancing techniques can bring % to 6% extra performance improvements. Categories and Subject Descriptors B.7. [Integrated Circuits]: Design Aids Simulation General Terms Algorithms, Design Keywords Capacitance extraction, parallel fast multipole method, GPU. INTRODUCTION Nowadays high-performance integrated circuit (IC) designs require efficient and accurate extraction of interconnect parasitics including resistance, inductance, and capacitance that can be further integrated with other circuit com- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC, June 5-,, San Diego, California, USA Copyright ACM //6...$.. Zhuo Feng Department of ECE Michigan Technological University Houghton, MI, 4993 zhuofeng@mtu.edu ponents for SPICE simulations. Since on-chip capacitance will significantly impact circuit performance such as operating speed, functionality, etc, developing efficient algorithms for large scale capacitance extraction problems becomes increasingly critical considering the aggressive semiconductor technology scalings. A variety of fast capacitance extraction approaches has been proposed in the past few decades that usually falls into two categories: the boundary element methods (BEM) [ 4] and floating random walk algorithms (FRW) [5,6], among which BEM methods are suitable for extracting coupling capacitance while FRW methods are typically adopted for calculating self-capacitance. The dramatic evolution of present day multi-/many core processors [7] brings huge opportunities for accelerating the computing-intensive capacitance extraction tasks. However, there has been little achievement made in the past few years for leveraging such parallel computing platforms considering the grand challenges brought by the very complicated interconnect structures as well as parallel algorithm complexity. For instance, in [4], an FMM-based capacitance extraction program merely achieves up to 3X speedups on a quadcore CPU over the serial extraction program. On the other hand, recent general purpose computing on graphics processors (GPUs) has become a very popular research area, which provides desired energy efficient parallel computing and, if suitably developed, retains much higher computing throughput than existing multi-core CPUs. The latest Fermi GPUs [7] integrate more than 5 cores into a single chip and deliver greater than one TFlops ( floating point operations) peak computing performances. It is therefore quite desired to leverage the latest GPU s computing power by developing energy efficient GPU-based algorithms for solving large scale capacitance extraction problems. Until recent years, there are a few research projects which focused on developing parallel FMM algorithms on GPUs [8, 9]. However, existing GPU-based FMM methods merely exploit parallel computation of multipole and local expansion coefficients on-the-fly during each sparse matrix-vector multiplication (SPMV), without optimizing the GPU data structures or algorithm flows according to the latest GPU hardware properties, which can not be directly deployed for efficient acceleration of GPU-based 3-D capacitance extraction programs. In this work, for the first time we propose GPU-friendly data structures and FMM algorithm flows which can effectively minimize the CPU-GPU data transferring cost for the key FMM kernel functions that include the preconditioning, direct summation, upward/downward as well as evaluation passes. We also show simple yet effective GPU performance modeling and workload balancing techniques for properly distributing FMM s computing tasks among GPU s streaming multiprocessors (SMs), which allows to more efficiently leverage the latest GPU s concurrent kernel executions in the capacitance extraction tasks. 558

2 . BACKGROUND. Capacitance Extraction Problem Consider a system containing m ideal conductors embedded in a homogeneous dielectric medium. The surfaces or edges of all conductors are broken into small panels or tiles. It is also assumed that on each panel i, the charge q i is uniformly distributed. The potential of each evaluation panel (receiver) i can be obtained by summing up the potential contributions from all other panel charges (unknowns) using specific Green s functions. The final capacitance values of those m conductors can be summarized by an m m capacitance matrix C, where the diagonal entry C ii represents the self-capacitance value for conductor i, and the off-diagonal entry C ij represents the coupling capacitance between conductors i and j. The j-th column of C can be computed by finding the charge distributions on all conductors when the j-th conductor is raised to unit potential and others are grounded. In fact, the charge distributions on the surface of the conductor can be solved using the first-kind integral equation: ψ(x) = G(x, x )σ(x )da, () surfaces where x, x R 3 denote the receiver and source locations, σ denotes the surface charge density, da is the incremental surface area, ψ is the surface potential which is known, and i= x x G(x, x ) denotes the Green s function which is in free space []. Assume that m conductors are discretized into a total of n panels. Then the potential on each evaluation panel k (receiver) can be computed by: n σ i(x ) p k = panel i x x k da, () where x k is the center of evaluation panel k, x is the position on the surface of panel i, p k is the potential at the center of the evaluation panel k, σ i(x ) is the surface charge density on the panel i, andda is the incremental surface area of panel i.applying the above formula for all the n panels, Eqn. () can be expressed using the following linear system of equations: Φ q = p, (3) where Φ R n n, while q and p R n are the charge and potential vectors. Since the charges are uniformly distributed on every panel, the entries of Φ matrix can be obtained by Φ (k,i) = panel i x x da,wherek denotes the index of the evaluation panel, and i denotes the index of source panel. When solving for C ij, panel potentials for conductor i are raised to while other conductors panel potentials are set to be. Subsequently the unknown charge vector q can be obtained by solving Eqn. (3). Next, the ij-th element C ij of the capacitance matrix C can be obtained by summing up all the panel charges on the i-th conductor []: C ij = q k, (4) k conductor i. The Fast Multipole Algorithm (FMM) There are many existing techniques proposed to numerically approximate the matrix-vector multiplication Φ q in Eqn. (3), including the Fastcap algorithm [, ] based on fast multipole methods [], the hierarchical capacitance extraction algorithm [3] based on Appel s algorithm [], and the precorrected-fft method [] which projects panel charges onto 3-D regular grid points and evaluates the distant charge points interactions through FFT computations. Running gtime Kernel Kernel Kernel3 Kernel4 Kernel5 Kernel6 SMOccupancy SerialKernelExecution Running gtime Kernel Kernel Kernel3 (cont.) Kernel5 SMOccupancy Kernel4 Kernel6 Kernel 3 ConcurrentKernelExecution Figure : Concurrent kernel executions on Fermi GPU and serial kernel executions on previous GPUs. The fast multipole methods (FMM) have been widely used for solving general N-body problems [], which can be applied to compute panel k s total potential: p k = Φ (k,i) q i + Φ (k,i) q i, x i,x k R d, (5) x i / Ω(x k) x i Ω(x k) where {x i} denote the centers of source panels, {x k } denote the centers of evaluation panels, d is the dimensionality of the problem and Ω(x k ) is some neighborhood of evaluation panel k. In Eqn. (5), the latter multiplication term Φ (k,i) q i with x i Ω(x k ), is performed directly. All other FMM steps are dedicated to approximating the term Φ (k,i) q i with x i / Ω(x k ) using the multipole and local expansions, according to a user supplied error tolerance ε..3 Nvidia Fermi GPU The latest Nvidia Fermi GPU architecture provides more opportunities and flexibilities for general computing-intensive tasks [7]. Compared with previous GPU models, the Fermi GPU from Nvidia has increased the number of streaming processors (SPs) in each streaming multi-processor (SM) from 8 to 3, resulting in totally 5 streaming processors in a GPU. More importantly, the new GPU model also supports high performance double precision computing and concurrent kernel executions. On Fermi GPUs, up to 6 kernels can be concurrently launched on the 6 SMs [7], whereas in previous GPU architectures, only one kernel can be launched at a time on GPU. As shown in Fig., running computing tasks concurrently is typically more efficient than running them in series, especially when each single task can not fully occupy all the SMs of a GPU. In this work, we propose a simple workload balancing method which can effectively assign computing tasks among the streaming multiprocessors (SMs) and take advantage of Fermi GPU s concurrent kernel executions. 3. FAST MULTIPOLE METHOD ON GPU 3. FMM Computation Decomposition In FMM methods, there are five key steps. The computations associated with each of the five steps have been concluded as follows. The preconditioning pass requires solving the panel charges by directly inverting the approximate blockdiagonal potential matrix Φ in Eqn. (3) obtained by including the panels that belong to the overlapped neighboring cubes. In this step, the potential matrix inverses are computed in advance, and subsequently many small dense matrix-vector multiplications are performed. It should be noted that the dense matrices 559

3 sizes are similar in this step, and the computation time is roughly % of the overall FMM runtime. The direct pass directly sums up all the potentials contributed by source panel charges that locate in the self and neighboring cubes. Therefore, similarly to the preconditioning pass, banded sparse matrix-vector multiplications are needed. The computing time spent in this step is around % of the total FMM runtime. The upward pass generates the multipole expansions for the finest level cubes and converts them to the expansions for the coarser to coarsest level cubes. Computations involved in this step are less expensive (%) when compared with the cost of other FMM steps. This step can be parallelized in a level-by-level manner. The downward and evaluation passes involve the most expensive and time-consuming computations in the FMM flow, which usually take more than 7% of the total runtime. The potential contributions from the cube panels which are not included in the self and neighboring cubes are evaluated in this step. The major computations can be done using small dense matrixvector multiplications, but the dense matrix sizes for each evaluation cube can be quite different depending on the orders of multipole expansions to be used. 3. Coefficient Matrices in FMM algorithm In the capacitance extraction problem, conductor surfaces are first discretized into many small cubes, and each of the cubes is further decomposed into several panels that hold uniformly distributed charges. As described in [], in typical FMM method, a coefficient matrix Φ (k,j) R n m can be used to compute the panel potentials of the evaluation cube k (receivers) contributed by the charges or expansion sources inthesourcecubej, wheren is the number of panels in evaluation cube k, andm is the number of panel charges or expansions associated with source cube j. Such coefficient matrices may fall into several categories: QP matrix (Φ QP ): projects charges to potentials; QM matrix (Φ QM ): expansions; projects charges to multipole MM matrix (Φ MM ): converts the finest level multipole expansions to coarser and coarsest levels multpole expansions; LL matrix (Φ LL): translates the coarsest local expansions to finer and finest level local expansions; ML matrix (Φ ML): projects multipole expansions to local expansions. MP matrix (Φ MP ): projects multipole expansions to potentials. LP matrix (Φ LP ): projects local expansions to potentials. In the above QP, MP and LP matrices, the numbers of rows of each matrix are equal to the number of panels in the evaluation (receiver) cube, while the number of columns depends on the order of multipole and local expansions, as well as the number of charges associated with the source cube. Directcharges ofsourcecubei q i m Panelpotentialsin n cubek: p k Multipoleexpansions ofsourcecubej s m j ( k, i) ( k, j) p k QP q i M P m q i ( k, i) ( k, j) j p k Q P M P m j ( ki, ) nm QPcoefficientmatrix: QP MPcoefficientmatrix: ( kj, ) ns I I ( ki, ) ( k, j) ( ki, ) ( k, j) p k Q P M P Q P M P I (, lk) rd (, lk) rd P P p l Q Q p mix I mix QPindexmatrix: ( ki, ) I QP nm MPindexmatrix:I ( k, j) MP T T T T T T T Globalvector: Vec global q q x m m y l l z Figure : Hierarchical coefficient/index matrix compositions for GPU-friendly FMM computations. 3.3 GPU-Friendly Data Structure When FMM algorithm is applied to capacitance extraction problems, the order of multipole expansions may vary according to the desired accuracy level. It has been shown that second order multipole expansions in FMM are sufficient for achieving the desired accuracy levels. Since the typical coefficient matrices in FMM are very small and the number of columns may vary significantly, processing these small coefficient matrices on GPU may not be efficient considering GPU s single-instruction-multiple-data (SIMD) computing scheme and relatively large device memory access latencies. In this work, we propose a GPU-friendly data structure for efficiently storing and processing the coefficient matrices on GPU, specifically for capacitance extraction problems. In reality, for a specific evaluation cube, there can be many source cubes that contribute to the total potentials of its panels. Since all the coefficient matrices Φ (k,j) (j =,..., t) associated with the evaluation cube k always have the same row dimensions (the number of receivers), we can pack all source cube related coefficient matrices together to form a larger coefficient matrix Φ k =[Φ (k,)... Φ (k,t) ] by appending all those small coefficient matrices Φ (k,j) in the column dimension, where t is the number of source cubes that will contribute to the total potentials of the panels in the evaluation cube k. Fig. demonstrates how to combine all the FMM coefficient matrices together to form a larger one, where p k R n denotes the potential vector of the evaluation cube k, Φ (k,i) QP and Φ(k,j) MP denote the QP and MP coefficient matrices for the evaluation cube k and source cubes i and j, q i R m isthechargevectorinsourcecubei, and m j R s is the multipole expansion vector of cube j. After combining the above coefficient matrices into a larger coefficient matrix, we get the resultant coefficient matrix ns MP Φ mix R U V, where U is the total number of evaluation cubes and V is the maximum column dimension of Φ k. We also form a global source cube vector Vec global that includes all panel charges, multipole and local expansions for all coarsest to finest levels. Meanwhile, an index matrix is proposed to locate the source cube related charges and ex- 56

6 x4 6 x4 Cubeindicesafterordering Pan 5 5 3 nelindices4 Pan 3 nelindices4 5 5 NumberofCoefficientsbeforeorderingg 5 5 NumberofCoefficientsafterorderingg Figure 3: Comparison of non-zero entries of

, five types of FMM computations are all related to dense coefficient matrix-vector multiplications.

For the preconditioning and direct passes, the coefficient matrices are small and having similar sizes.

We describe how to accelerate the evaluation pass on GPU in details, while other preconditioning, direct, upward and downward passes can be performed in the similar ways.

contributions) in a very efficient way using GPU s hundreds of streaming processors (SPs). As shown in Fig., we define an element-wise operator to achieve this goal on GPU.

Load panel charges, multipole or local expansions from Vec global according to their corresponding indices stored in I mix (as shown in Fig. );.

4 6 x4 6 x4 Cubeindicesafterordering Pan nelindices4 Pan 3 nelindices4 5 5 NumberofCoefficientsbeforeorderingg 5 5 NumberofCoefficientsafterorderingg Figure 3: Comparison of non-zero entries of the coefficient matrices before/after cube sorting. pansion coefficients stored in the global vector Vec global. 3.4 Accelerating FMM on GPU As discussed in Section 3., five types of FMM computations are all related to dense coefficient matrix-vector multiplications. So once we stored the coefficient matrices on GPUs, all steps can be performed in very similar manners. For the preconditioning and direct passes, the coefficient matrices are small and having similar sizes. On the contrary, for the downward and evaluation passes, the coefficient matrix sizes can be quite different for different cubes. We describe how to accelerate the evaluation pass on GPU in details, while other preconditioning, direct, upward and downward passes can be performed in the similar ways. The total potential for each evaluation panel can be obtained by summing up all the potential contributions from all the source cubes (including the charge contributions and expansion source contributions) in a very efficient way using GPU s hundreds of streaming processors (SPs). As shown in Fig., we define an element-wise operator to achieve this goal on GPU. The element-wise operation for the evaluation pass can be performed in the following steps:. Load panel charges, multipole or local expansions from Vec global according to their corresponding indices stored in I mix (as shown in Fig. );. Multiply the loaded the charges or expansions with their corresponding coefficient matrix elements stored in Φ mix; 3. Sum up the multiplication results for each row and store the final result into the panel potential vector p. 3.5 Cube Clustering We want to emphasize that all the FMM coefficient matrices, as well as index matrices need to be generated and transferred to GPU for one time before the capacitance extraction algorithm starts. Subsequently for each GMRES iteration using FMM, only two GPU-CPU data transfers of the potential vector p and the charge vector q are required. Since the sizes of the above vectors are fairly small (equal to the number of panels), the total data communication time between CPU and GPU can be negligible when compared with the overall runtime of the GPU-based FMM computations. As mentioned in Section 3., in realistic capacitance extraction problems, different evaluation cubes may be influenced by different numbers of source cubes or expansion sources, which may lead to drastically different column dimensions of the coefficient matrices Φ k. As an example, in Fig. 3 we show the numbers of non-zero coefficients for all the panels obtained from a realistic capacitance extraction problem, where the black area reflects the numbers of nonzero coefficients in the coefficient matrix. To avoid GPU Figure 4: clusters. Orig.CoefficientMatrix ClusteredMatrices NumberofCoefficients 8SMs 3SMs SMs SMs Concurrent kernel executions for cube thread branching and inefficient GPU memory access patterns, we can fill dummy zero elements into the coefficient matrix Φ mix, as well as the index matrix I mix (Fig. ). However, this will result in lower memory and computation efficiencies, especially when the column dimension of the coefficient matrix for each panel varies dramatically from one to another. To further improve the memory efficiency, and minimize the number of dummy coefficients in these coefficient matrices, and achieve better workload balancing during GPU s parallel computing, a simple yet effective cube clustering technique is adopted to decompose the original coefficient matrix into several smaller coefficient matrices (of cube clusters). During the coefficient matrix decomposition (cube clustering) step, each of the new matrix clusters should maintain a sufficiently large row dimension to fully occupy at least one GPU s streaming multiprocessor (SM). The cube clustering step can be done by putting together those cubes whose Φ k matrices have the similar column dimensions, into the same cluster. The resultant clusters again form new coefficient matrices whose dummy elements are significantly reduced compared to the original coefficient matrix (as shown in Fig. 4). Although both memory occupancy and GPU runtime performance can be further improved by using cube clustering technique, for very small test cases, coefficient matrices after clustering can still be small which may not fully utilize all the computing resources of streaming multiprocessors (SMs) on GPU if processed one after one. To gain higher GPU computing efficiency, in the following section, we propose an efficient workload balancing method for concurrent kernel executions on the latest Fermi GPUs, which allows processing more than one cube cluster on GPU concurrently (as shown in Fig. 4). 3.6 Workload Balancing on GPU As the problem size increases, more GPU computational time for a kernel is needed. As shown in Fig. 5, the blue circles sitting on the solid line denote the ideal linear speedup runtimes (parallel efficiency is.) when using different numbers of streaming multiprocessors, while the red dotted curve denotes the measured runtime results for the FMM downward pass kernel which implies a super linear speedup performance when using less than four SMs (parallel efficiency is greater than.). Although at the first glance it is hard to imagine such a super linear speedup performance since the runtime should decrease linearly with the number of parallel processors, the observed high parallel efficiency can be actually achieved by carefully utilizing the on-chip cache memory and processing resources of the parallel computing platforms. However, when more and more SMs are used for executing the kernel, the parallel efficiency may go down 56

5 T T / T /3 T /4 Runtime LinearSpeedups R RealisticSpeedups i S d T /n SuperlinearSpeedupRange 3 4 n NumberofStreamingMultiprocessors(SMs) Figure 5: Runtime scalability for processing a cube cluster on multiple SMs during the downward pass. quickly, since some of the SMs can be idle due to insufficient computation tasks. From Fig. 5, we observe that when more than four SMs are assigned to execute the FMM downward pass kernel, the GPU runtime starts to saturate and the parallel efficiency can go down drastically. In order to achieve the optimal parallel computing efficiency using the latest Fermi GPU s concurrent kernel executions, we first characterize the optimal SM assignments for typical cluster sizes by running a group of tests. Then the optimal number of SMs (the super linear speedup region illustrated in Fig. 5) to be used for processing a specific cube cluster can be easily obtained. Next, in each of the following FMM procedures, the final SM assignment can be determined based on the actual coefficient matrix cluster sizes, and the prior characterized optimal numbers of SMs: N i = N sm Si, (6) p S i where N i denotes the number of SMs finally assigned to execute kernel i (for a cube cluster), N sm denotes the total number of SMs available on GPU, S i denotes the measured optimal number of SMs for running kernel i, andp denotes the total number of kernels (cube clusters) to be executed at the same time. The above simple performance modeling and workload balancing method has significantly improved our FMMGpu throughput, especially for the small test cases (% to 6% improvement has been observed), since the serial processing of these small cube clusters may not fully occupy all the SMs on GPU. However, for very large test cases whose coefficient matrices (of a cube cluster) are large enough to occupy all the SMs, the above workload balancing is not necessary. The proposed GPU-based capacitance extraction algorithm FMMGpu has been concluded in Algorithm, where K is the user supplied maximum number of GMRES iterations, ro k is the residual in the k-th GMRES iteration and tol is the error tolerance set by the user. 4. EXPERIMENT RESULTS Extensive experiments have been conducted to validate the proposed GPU-based fast multipole method for capacitance extraction. A set of bus crossing test cases are created in our experiments, and details are shown in Table. For all thetestcases,wesettheedgetoinnerpanelwidthratioto be., and the multipole expansion order to be. The proposed method is implemented in C and the GPU programming language CUDA [3]. All experiments are performed on Ubuntu bit with.66ghz quad-core CPU, 6GB i= Algorithm FMMGpu Algorithm Flow : Pack all the coefficient matrices for the evaluation cube i (receiver) into a single matrix Φ i for i =ton. : Order the cubes i (i =,..., n) based on the column dimensions of the coefficient matrices Φ i. 3: Cluster Φ i (i =,..., n) into several new coefficient matrices and generate corresponding index matrices. 4: Transfer the coefficient matrices and index matrices to GPU s global memory. 5: Characterize the optimal numbers of SMs for workload balancing by running a few small test cases. 6: Start the following FMMGpu computation and GMRES iterations: 7: for (k = ; k < K; k ++): do 8: Transfer the panel charge vector q to GPU memory. 9: Execute the P reconditioning pass on GPU and store the updated charges into Vec global. : Execute the Direct pass on GPU and store the direct potential contributions into p. : Execute the Upward pass on GPU and store the multipole expansions into Vec global. : Execute Downward pass on GPU with workload balancing and concurrent kernel executions. Then store the local expansions into Vec global. 3: Execute Evaluation pass on GPU with workload balancing and concurrent kernel executions. Then sum the potentials on the evaluation panels to p. 4: Transfer the potential vector p back to CPU for the k-th GMRES iteration. 5: Compute residual r o (k) after this GMRES iteration, and check the convergence: 6: if r (k) o < tol then 7: Exit the loop and compute the capacitance matrix values. 8: end if 9: end forreturn all the computed capacitance matrix elements. Table : Test case details. c denotes the number of crossing conductors (5 5 means conductors), n represents the number of panels per wire width, and N represents the total number of panels. Test c n N test K test K test K test K test K test K DRAM memory, and one Nvidia GeForce GTX48 GPU with.5gb device memory. Table demonstrates the CPU and GPU runtime results for five key steps of the FMM algorithm, as well as the total runtime of a complete FMM iteration. Since the coefficient matrices associated with the P reconditioning and Direct passes have similar coefficient matrix dimensions that do not require coefficient matrix decompositions, clustering technique is only applied to the Downward and Evaluation passes. From Table, we observe X to 45X speedups for P reconditioning passes, 3X to 45X speedups for Direct passes,.x to 4X speedups for Upwardpasses (non-critical kernel), 5X to 33X speedups for Downward passes, X to 3X speedups for Evaluation passes, and 8X to 3X speedups for the overall FMM iteration. Furthermore, Table 3 shows the results of complete capacitance extraction runs, where the original CPU-based FMM algorithm and our FMMGpu algorithm have been run for all test cases. As observed, both algorithms converge in the same number of GMRES iterations. Including the CPU-based computations such as the calculation of 56

6 Table : Runtime results of key FMM steps on CPU and GPU after cube clustering. Test Preconditioning (ms) Direct (ms) Upward (ms) Downward (ms) Evaluation (ms) Total (ms) CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU test 8..34(4X) 6..6(5X).6.48(.X) 4..(X) 38..9(X) (8X) test 3..65(X) 6..78(3X) (3X) 45..9(5X) (7X) (8X) test (3X) 9..55(9X) 4..7(3.5) (8X) (9X) 57..7(X) test (3X) (7X) 5..8(4X) (6X) (3X) 34..8(6X) test (35X) 69..3(3X) 5..6(3X) (8X) (6X) (6X) test6 5..3(45X)..(45X) 6..68(3.5X) (33X) (4X) (3X) Table 3: Capacitance extraction results on CPU and GPU (clustering-based). N i is the number of GM- RES iterations for a given error tolerance. Test FMMCpu Clustering-based FMMGpu N i Time(s) N i Time(s) test (8X) test (8X) test (X) test (6X) test (5X) test (9X) RuntimeofsingleFMM iterationongpu(ms) 8 3X/3X 6 6X/9X 4 6X/9X X/33X 8X/9X clusteringw/oconcurrent concurrent kernelexecution 8 8X/X clusteringw/concurrent 6 kernelexecutionexecution 4 test test test3 test4 test5 test6 Figure 6: FMMGpu runtime w/ & w/o using workload balancing and concurrent kernel executions. GMRES residues, as well as the one-step Arnoldi process on CPU, the total runtime speedups of capacitance extraction on GPU are slightly smaller than the speedup numbers obtained for the single FMM iteration runs shown in Table. Finally, Fig. 6 shows the GPU runtime and speedup results of a single FMM iteration (SPMV) w/ and w/o using the proposed workload balancing and concurrent kernel execution schemes. We obtain 8X to 3X speedups w/o using concurrent kernel executions, and X to 3X speedups by using concurrent kernel executions. It should be noted that running FMM on multi-core CPU may bring very limited performance improvement (3X speedups are reported on a quad-core machine [4]), while our FMMGpu capacitance extraction can easily brings more than X speedups for all test cases which achieves much higher runtime and energy efficiencies. 5. CONCLUSIONS In this paper, we present a GPU accelerated fast multipole algorithm, FMMGpu, for fast parallel 3-D capacitance extraction. As shown in extensive experiments, our proposed GPU-friendly FMMGpu algorithm flow and data structures allow highly efficient massively parallel computing on GPU. We obtain up to 3X speedups by running the capacitance extraction program on GPU when compared with CPU-based serial executions. A simple yet effective workload balancing method is also proposed to facilitate concurrent kernel executions on the latest Fermi GPUs, which further improves the parallel FMM computing efficiency by a factor of % to 6% for a set of small test cases. 6. REFERENCES [] K. Nabors and J. White. FastCap: a multipole accelerated 3-D capacitance extraction program. IEEE Trans. on Computer-Aided Design, (): , Nov. 99. [] J. Phillips and J. White. A precorrected-fft method for electrostatic analysis of complicated 3-D structures. IEEE Trans. on Computer-Aided Design, 6():59 7, Oct [3] W.Shi,J.Liu,N.Kakani,andT.Yu.Afast hierarchical algorithm for 3-D capacitance extraction. In IEEE/ACM DAC, pages 7, June 998. [4] F. Gong, H. Yu, and L. He. Picap: A parallel and incremental capacitance extraction considering stochastic process variation. In IEEE/ACM DAC, pages , Jul. 9. [5] R. Iverson and Y. Le Coz. A Stochastic Algorithm for High Speed capacitance Extraction in Integrated Circuits. Solid-State Electronics, 35(7):5, 99. [6] T. El-Moselhy, I. Elfadel, and L. Daniel. A hierarchical floating random walk algorithm for fabric-aware 3D capacitance extraction. In IEEE/ACM ICCAD, pages , 9. [7] NVIDIA Corporation. Fermi compute architecture white paper. [Online]. Available: architecture.html,. [8] N. Gumerov and R. Duraiswami. Fast multipole methods on graphics processors. J. Comput. Phys., 7(8):89 833, 8. [9] T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 4 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In SC 9, pages, 9. [] K.Nabors,S.Kim,andJ.White.Fastcapacitance extraction of general three-dimensional structures. IEEE Trans. on Microwave Theory and Techniques, 4(7):496 56, Jul. 99. [] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 73():35 348, 987. [] A. Appel. An efficient program for many-body simulation. SIAM Journal on Scientific and Statistical Computing, 6():85 3, 985. [3] NVIDIA Corporation. NVIDIA CUDA C programming guide. [Online]. Available: 563

Center for Computational Science

Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,