Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms

Size: px
Start display at page:

Download "Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms"

Transcription

1 Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Xueqian Zhao Department of ECE Michigan Technological University Houghton, MI, 4993 ABSTRACT To facilitate full chip capacitance extraction, field solvers are typically deployed for characterizing capacitance libraries for various interconnect structures and configurations. In the past decades, various algorithms for accelerating boundary element methods (BEM) have been developed to improve the efficiency of field solvers for capacitance extraction. This paper presents the first massively parallel capacitance extraction algorithm FMMGpu that accelerates the well-known fast multipole methods (FMM) on modern Graphics Processing Units (GPUs). We propose GPUfriendly data structures and SIMD parallel algorithm flows to facilitate the FMM-based 3-D capacitance extraction on GPU. Effective GPU performance modeling methods are also proposed to properly balance the workload of each critical kernel in our FMMGpu implementation, by taking advantage of the latest Fermi GPU s concurrent kernel executions on streaming multiprocessors (SMs). Our experimental results show that FMMGpu brings X to 3X speedups in capacitance extractions for various test cases. We also show that even for small test cases that may not well utilize GPU s hardware resources, the proposed cube clustering and workload balancing techniques can bring % to 6% extra performance improvements. Categories and Subject Descriptors B.7. [Integrated Circuits]: Design Aids Simulation General Terms Algorithms, Design Keywords Capacitance extraction, parallel fast multipole method, GPU. INTRODUCTION Nowadays high-performance integrated circuit (IC) designs require efficient and accurate extraction of interconnect parasitics including resistance, inductance, and capacitance that can be further integrated with other circuit com- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC, June 5-,, San Diego, California, USA Copyright ACM //6...$.. Zhuo Feng Department of ECE Michigan Technological University Houghton, MI, 4993 zhuofeng@mtu.edu ponents for SPICE simulations. Since on-chip capacitance will significantly impact circuit performance such as operating speed, functionality, etc, developing efficient algorithms for large scale capacitance extraction problems becomes increasingly critical considering the aggressive semiconductor technology scalings. A variety of fast capacitance extraction approaches has been proposed in the past few decades that usually falls into two categories: the boundary element methods (BEM) [ 4] and floating random walk algorithms (FRW) [5,6], among which BEM methods are suitable for extracting coupling capacitance while FRW methods are typically adopted for calculating self-capacitance. The dramatic evolution of present day multi-/many core processors [7] brings huge opportunities for accelerating the computing-intensive capacitance extraction tasks. However, there has been little achievement made in the past few years for leveraging such parallel computing platforms considering the grand challenges brought by the very complicated interconnect structures as well as parallel algorithm complexity. For instance, in [4], an FMM-based capacitance extraction program merely achieves up to 3X speedups on a quadcore CPU over the serial extraction program. On the other hand, recent general purpose computing on graphics processors (GPUs) has become a very popular research area, which provides desired energy efficient parallel computing and, if suitably developed, retains much higher computing throughput than existing multi-core CPUs. The latest Fermi GPUs [7] integrate more than 5 cores into a single chip and deliver greater than one TFlops ( floating point operations) peak computing performances. It is therefore quite desired to leverage the latest GPU s computing power by developing energy efficient GPU-based algorithms for solving large scale capacitance extraction problems. Until recent years, there are a few research projects which focused on developing parallel FMM algorithms on GPUs [8, 9]. However, existing GPU-based FMM methods merely exploit parallel computation of multipole and local expansion coefficients on-the-fly during each sparse matrix-vector multiplication (SPMV), without optimizing the GPU data structures or algorithm flows according to the latest GPU hardware properties, which can not be directly deployed for efficient acceleration of GPU-based 3-D capacitance extraction programs. In this work, for the first time we propose GPU-friendly data structures and FMM algorithm flows which can effectively minimize the CPU-GPU data transferring cost for the key FMM kernel functions that include the preconditioning, direct summation, upward/downward as well as evaluation passes. We also show simple yet effective GPU performance modeling and workload balancing techniques for properly distributing FMM s computing tasks among GPU s streaming multiprocessors (SMs), which allows to more efficiently leverage the latest GPU s concurrent kernel executions in the capacitance extraction tasks. 558

2 . BACKGROUND. Capacitance Extraction Problem Consider a system containing m ideal conductors embedded in a homogeneous dielectric medium. The surfaces or edges of all conductors are broken into small panels or tiles. It is also assumed that on each panel i, the charge q i is uniformly distributed. The potential of each evaluation panel (receiver) i can be obtained by summing up the potential contributions from all other panel charges (unknowns) using specific Green s functions. The final capacitance values of those m conductors can be summarized by an m m capacitance matrix C, where the diagonal entry C ii represents the self-capacitance value for conductor i, and the off-diagonal entry C ij represents the coupling capacitance between conductors i and j. The j-th column of C can be computed by finding the charge distributions on all conductors when the j-th conductor is raised to unit potential and others are grounded. In fact, the charge distributions on the surface of the conductor can be solved using the first-kind integral equation: ψ(x) = G(x, x )σ(x )da, () surfaces where x, x R 3 denote the receiver and source locations, σ denotes the surface charge density, da is the incremental surface area, ψ is the surface potential which is known, and i= x x G(x, x ) denotes the Green s function which is in free space []. Assume that m conductors are discretized into a total of n panels. Then the potential on each evaluation panel k (receiver) can be computed by: n σ i(x ) p k = panel i x x k da, () where x k is the center of evaluation panel k, x is the position on the surface of panel i, p k is the potential at the center of the evaluation panel k, σ i(x ) is the surface charge density on the panel i, andda is the incremental surface area of panel i.applying the above formula for all the n panels, Eqn. () can be expressed using the following linear system of equations: Φ q = p, (3) where Φ R n n, while q and p R n are the charge and potential vectors. Since the charges are uniformly distributed on every panel, the entries of Φ matrix can be obtained by Φ (k,i) = panel i x x da,wherek denotes the index of the evaluation panel, and i denotes the index of source panel. When solving for C ij, panel potentials for conductor i are raised to while other conductors panel potentials are set to be. Subsequently the unknown charge vector q can be obtained by solving Eqn. (3). Next, the ij-th element C ij of the capacitance matrix C can be obtained by summing up all the panel charges on the i-th conductor []: C ij = q k, (4) k conductor i. The Fast Multipole Algorithm (FMM) There are many existing techniques proposed to numerically approximate the matrix-vector multiplication Φ q in Eqn. (3), including the Fastcap algorithm [, ] based on fast multipole methods [], the hierarchical capacitance extraction algorithm [3] based on Appel s algorithm [], and the precorrected-fft method [] which projects panel charges onto 3-D regular grid points and evaluates the distant charge points interactions through FFT computations. Running gtime Kernel Kernel Kernel3 Kernel4 Kernel5 Kernel6 SMOccupancy SerialKernelExecution Running gtime Kernel Kernel Kernel3 (cont.) Kernel5 SMOccupancy Kernel4 Kernel6 Kernel 3 ConcurrentKernelExecution Figure : Concurrent kernel executions on Fermi GPU and serial kernel executions on previous GPUs. The fast multipole methods (FMM) have been widely used for solving general N-body problems [], which can be applied to compute panel k s total potential: p k = Φ (k,i) q i + Φ (k,i) q i, x i,x k R d, (5) x i / Ω(x k) x i Ω(x k) where {x i} denote the centers of source panels, {x k } denote the centers of evaluation panels, d is the dimensionality of the problem and Ω(x k ) is some neighborhood of evaluation panel k. In Eqn. (5), the latter multiplication term Φ (k,i) q i with x i Ω(x k ), is performed directly. All other FMM steps are dedicated to approximating the term Φ (k,i) q i with x i / Ω(x k ) using the multipole and local expansions, according to a user supplied error tolerance ε..3 Nvidia Fermi GPU The latest Nvidia Fermi GPU architecture provides more opportunities and flexibilities for general computing-intensive tasks [7]. Compared with previous GPU models, the Fermi GPU from Nvidia has increased the number of streaming processors (SPs) in each streaming multi-processor (SM) from 8 to 3, resulting in totally 5 streaming processors in a GPU. More importantly, the new GPU model also supports high performance double precision computing and concurrent kernel executions. On Fermi GPUs, up to 6 kernels can be concurrently launched on the 6 SMs [7], whereas in previous GPU architectures, only one kernel can be launched at a time on GPU. As shown in Fig., running computing tasks concurrently is typically more efficient than running them in series, especially when each single task can not fully occupy all the SMs of a GPU. In this work, we propose a simple workload balancing method which can effectively assign computing tasks among the streaming multiprocessors (SMs) and take advantage of Fermi GPU s concurrent kernel executions. 3. FAST MULTIPOLE METHOD ON GPU 3. FMM Computation Decomposition In FMM methods, there are five key steps. The computations associated with each of the five steps have been concluded as follows. The preconditioning pass requires solving the panel charges by directly inverting the approximate blockdiagonal potential matrix Φ in Eqn. (3) obtained by including the panels that belong to the overlapped neighboring cubes. In this step, the potential matrix inverses are computed in advance, and subsequently many small dense matrix-vector multiplications are performed. It should be noted that the dense matrices 559

3 sizes are similar in this step, and the computation time is roughly % of the overall FMM runtime. The direct pass directly sums up all the potentials contributed by source panel charges that locate in the self and neighboring cubes. Therefore, similarly to the preconditioning pass, banded sparse matrix-vector multiplications are needed. The computing time spent in this step is around % of the total FMM runtime. The upward pass generates the multipole expansions for the finest level cubes and converts them to the expansions for the coarser to coarsest level cubes. Computations involved in this step are less expensive (%) when compared with the cost of other FMM steps. This step can be parallelized in a level-by-level manner. The downward and evaluation passes involve the most expensive and time-consuming computations in the FMM flow, which usually take more than 7% of the total runtime. The potential contributions from the cube panels which are not included in the self and neighboring cubes are evaluated in this step. The major computations can be done using small dense matrixvector multiplications, but the dense matrix sizes for each evaluation cube can be quite different depending on the orders of multipole expansions to be used. 3. Coefficient Matrices in FMM algorithm In the capacitance extraction problem, conductor surfaces are first discretized into many small cubes, and each of the cubes is further decomposed into several panels that hold uniformly distributed charges. As described in [], in typical FMM method, a coefficient matrix Φ (k,j) R n m can be used to compute the panel potentials of the evaluation cube k (receivers) contributed by the charges or expansion sources inthesourcecubej, wheren is the number of panels in evaluation cube k, andm is the number of panel charges or expansions associated with source cube j. Such coefficient matrices may fall into several categories: QP matrix (Φ QP ): projects charges to potentials; QM matrix (Φ QM ): expansions; projects charges to multipole MM matrix (Φ MM ): converts the finest level multipole expansions to coarser and coarsest levels multpole expansions; LL matrix (Φ LL): translates the coarsest local expansions to finer and finest level local expansions; ML matrix (Φ ML): projects multipole expansions to local expansions. MP matrix (Φ MP ): projects multipole expansions to potentials. LP matrix (Φ LP ): projects local expansions to potentials. In the above QP, MP and LP matrices, the numbers of rows of each matrix are equal to the number of panels in the evaluation (receiver) cube, while the number of columns depends on the order of multipole and local expansions, as well as the number of charges associated with the source cube. Directcharges ofsourcecubei q i m Panelpotentialsin n cubek: p k Multipoleexpansions ofsourcecubej s m j ( k, i) ( k, j) p k QP q i M P m q i ( k, i) ( k, j) j p k Q P M P m j ( ki, ) nm QPcoefficientmatrix: QP MPcoefficientmatrix: ( kj, ) ns I I ( ki, ) ( k, j) ( ki, ) ( k, j) p k Q P M P Q P M P I (, lk) rd (, lk) rd P P p l Q Q p mix I mix QPindexmatrix: ( ki, ) I QP nm MPindexmatrix:I ( k, j) MP T T T T T T T Globalvector: Vec global q q x m m y l l z Figure : Hierarchical coefficient/index matrix compositions for GPU-friendly FMM computations. 3.3 GPU-Friendly Data Structure When FMM algorithm is applied to capacitance extraction problems, the order of multipole expansions may vary according to the desired accuracy level. It has been shown that second order multipole expansions in FMM are sufficient for achieving the desired accuracy levels. Since the typical coefficient matrices in FMM are very small and the number of columns may vary significantly, processing these small coefficient matrices on GPU may not be efficient considering GPU s single-instruction-multiple-data (SIMD) computing scheme and relatively large device memory access latencies. In this work, we propose a GPU-friendly data structure for efficiently storing and processing the coefficient matrices on GPU, specifically for capacitance extraction problems. In reality, for a specific evaluation cube, there can be many source cubes that contribute to the total potentials of its panels. Since all the coefficient matrices Φ (k,j) (j =,..., t) associated with the evaluation cube k always have the same row dimensions (the number of receivers), we can pack all source cube related coefficient matrices together to form a larger coefficient matrix Φ k =[Φ (k,)... Φ (k,t) ] by appending all those small coefficient matrices Φ (k,j) in the column dimension, where t is the number of source cubes that will contribute to the total potentials of the panels in the evaluation cube k. Fig. demonstrates how to combine all the FMM coefficient matrices together to form a larger one, where p k R n denotes the potential vector of the evaluation cube k, Φ (k,i) QP and Φ(k,j) MP denote the QP and MP coefficient matrices for the evaluation cube k and source cubes i and j, q i R m isthechargevectorinsourcecubei, and m j R s is the multipole expansion vector of cube j. After combining the above coefficient matrices into a larger coefficient matrix, we get the resultant coefficient matrix ns MP Φ mix R U V, where U is the total number of evaluation cubes and V is the maximum column dimension of Φ k. We also form a global source cube vector Vec global that includes all panel charges, multipole and local expansions for all coarsest to finest levels. Meanwhile, an index matrix is proposed to locate the source cube related charges and ex- 56

4 6 x4 6 x4 Cubeindicesafterordering Pan nelindices4 Pan 3 nelindices4 5 5 NumberofCoefficientsbeforeorderingg 5 5 NumberofCoefficientsafterorderingg Figure 3: Comparison of non-zero entries of the coefficient matrices before/after cube sorting. pansion coefficients stored in the global vector Vec global. 3.4 Accelerating FMM on GPU As discussed in Section 3., five types of FMM computations are all related to dense coefficient matrix-vector multiplications. So once we stored the coefficient matrices on GPUs, all steps can be performed in very similar manners. For the preconditioning and direct passes, the coefficient matrices are small and having similar sizes. On the contrary, for the downward and evaluation passes, the coefficient matrix sizes can be quite different for different cubes. We describe how to accelerate the evaluation pass on GPU in details, while other preconditioning, direct, upward and downward passes can be performed in the similar ways. The total potential for each evaluation panel can be obtained by summing up all the potential contributions from all the source cubes (including the charge contributions and expansion source contributions) in a very efficient way using GPU s hundreds of streaming processors (SPs). As shown in Fig., we define an element-wise operator to achieve this goal on GPU. The element-wise operation for the evaluation pass can be performed in the following steps:. Load panel charges, multipole or local expansions from Vec global according to their corresponding indices stored in I mix (as shown in Fig. );. Multiply the loaded the charges or expansions with their corresponding coefficient matrix elements stored in Φ mix; 3. Sum up the multiplication results for each row and store the final result into the panel potential vector p. 3.5 Cube Clustering We want to emphasize that all the FMM coefficient matrices, as well as index matrices need to be generated and transferred to GPU for one time before the capacitance extraction algorithm starts. Subsequently for each GMRES iteration using FMM, only two GPU-CPU data transfers of the potential vector p and the charge vector q are required. Since the sizes of the above vectors are fairly small (equal to the number of panels), the total data communication time between CPU and GPU can be negligible when compared with the overall runtime of the GPU-based FMM computations. As mentioned in Section 3., in realistic capacitance extraction problems, different evaluation cubes may be influenced by different numbers of source cubes or expansion sources, which may lead to drastically different column dimensions of the coefficient matrices Φ k. As an example, in Fig. 3 we show the numbers of non-zero coefficients for all the panels obtained from a realistic capacitance extraction problem, where the black area reflects the numbers of nonzero coefficients in the coefficient matrix. To avoid GPU Figure 4: clusters. Orig.CoefficientMatrix ClusteredMatrices NumberofCoefficients 8SMs 3SMs SMs SMs Concurrent kernel executions for cube thread branching and inefficient GPU memory access patterns, we can fill dummy zero elements into the coefficient matrix Φ mix, as well as the index matrix I mix (Fig. ). However, this will result in lower memory and computation efficiencies, especially when the column dimension of the coefficient matrix for each panel varies dramatically from one to another. To further improve the memory efficiency, and minimize the number of dummy coefficients in these coefficient matrices, and achieve better workload balancing during GPU s parallel computing, a simple yet effective cube clustering technique is adopted to decompose the original coefficient matrix into several smaller coefficient matrices (of cube clusters). During the coefficient matrix decomposition (cube clustering) step, each of the new matrix clusters should maintain a sufficiently large row dimension to fully occupy at least one GPU s streaming multiprocessor (SM). The cube clustering step can be done by putting together those cubes whose Φ k matrices have the similar column dimensions, into the same cluster. The resultant clusters again form new coefficient matrices whose dummy elements are significantly reduced compared to the original coefficient matrix (as shown in Fig. 4). Although both memory occupancy and GPU runtime performance can be further improved by using cube clustering technique, for very small test cases, coefficient matrices after clustering can still be small which may not fully utilize all the computing resources of streaming multiprocessors (SMs) on GPU if processed one after one. To gain higher GPU computing efficiency, in the following section, we propose an efficient workload balancing method for concurrent kernel executions on the latest Fermi GPUs, which allows processing more than one cube cluster on GPU concurrently (as shown in Fig. 4). 3.6 Workload Balancing on GPU As the problem size increases, more GPU computational time for a kernel is needed. As shown in Fig. 5, the blue circles sitting on the solid line denote the ideal linear speedup runtimes (parallel efficiency is.) when using different numbers of streaming multiprocessors, while the red dotted curve denotes the measured runtime results for the FMM downward pass kernel which implies a super linear speedup performance when using less than four SMs (parallel efficiency is greater than.). Although at the first glance it is hard to imagine such a super linear speedup performance since the runtime should decrease linearly with the number of parallel processors, the observed high parallel efficiency can be actually achieved by carefully utilizing the on-chip cache memory and processing resources of the parallel computing platforms. However, when more and more SMs are used for executing the kernel, the parallel efficiency may go down 56

5 T T / T /3 T /4 Runtime LinearSpeedups R RealisticSpeedups i S d T /n SuperlinearSpeedupRange 3 4 n NumberofStreamingMultiprocessors(SMs) Figure 5: Runtime scalability for processing a cube cluster on multiple SMs during the downward pass. quickly, since some of the SMs can be idle due to insufficient computation tasks. From Fig. 5, we observe that when more than four SMs are assigned to execute the FMM downward pass kernel, the GPU runtime starts to saturate and the parallel efficiency can go down drastically. In order to achieve the optimal parallel computing efficiency using the latest Fermi GPU s concurrent kernel executions, we first characterize the optimal SM assignments for typical cluster sizes by running a group of tests. Then the optimal number of SMs (the super linear speedup region illustrated in Fig. 5) to be used for processing a specific cube cluster can be easily obtained. Next, in each of the following FMM procedures, the final SM assignment can be determined based on the actual coefficient matrix cluster sizes, and the prior characterized optimal numbers of SMs: N i = N sm Si, (6) p S i where N i denotes the number of SMs finally assigned to execute kernel i (for a cube cluster), N sm denotes the total number of SMs available on GPU, S i denotes the measured optimal number of SMs for running kernel i, andp denotes the total number of kernels (cube clusters) to be executed at the same time. The above simple performance modeling and workload balancing method has significantly improved our FMMGpu throughput, especially for the small test cases (% to 6% improvement has been observed), since the serial processing of these small cube clusters may not fully occupy all the SMs on GPU. However, for very large test cases whose coefficient matrices (of a cube cluster) are large enough to occupy all the SMs, the above workload balancing is not necessary. The proposed GPU-based capacitance extraction algorithm FMMGpu has been concluded in Algorithm, where K is the user supplied maximum number of GMRES iterations, ro k is the residual in the k-th GMRES iteration and tol is the error tolerance set by the user. 4. EXPERIMENT RESULTS Extensive experiments have been conducted to validate the proposed GPU-based fast multipole method for capacitance extraction. A set of bus crossing test cases are created in our experiments, and details are shown in Table. For all thetestcases,wesettheedgetoinnerpanelwidthratioto be., and the multipole expansion order to be. The proposed method is implemented in C and the GPU programming language CUDA [3]. All experiments are performed on Ubuntu bit with.66ghz quad-core CPU, 6GB i= Algorithm FMMGpu Algorithm Flow : Pack all the coefficient matrices for the evaluation cube i (receiver) into a single matrix Φ i for i =ton. : Order the cubes i (i =,..., n) based on the column dimensions of the coefficient matrices Φ i. 3: Cluster Φ i (i =,..., n) into several new coefficient matrices and generate corresponding index matrices. 4: Transfer the coefficient matrices and index matrices to GPU s global memory. 5: Characterize the optimal numbers of SMs for workload balancing by running a few small test cases. 6: Start the following FMMGpu computation and GMRES iterations: 7: for (k = ; k < K; k ++): do 8: Transfer the panel charge vector q to GPU memory. 9: Execute the P reconditioning pass on GPU and store the updated charges into Vec global. : Execute the Direct pass on GPU and store the direct potential contributions into p. : Execute the Upward pass on GPU and store the multipole expansions into Vec global. : Execute Downward pass on GPU with workload balancing and concurrent kernel executions. Then store the local expansions into Vec global. 3: Execute Evaluation pass on GPU with workload balancing and concurrent kernel executions. Then sum the potentials on the evaluation panels to p. 4: Transfer the potential vector p back to CPU for the k-th GMRES iteration. 5: Compute residual r o (k) after this GMRES iteration, and check the convergence: 6: if r (k) o < tol then 7: Exit the loop and compute the capacitance matrix values. 8: end if 9: end forreturn all the computed capacitance matrix elements. Table : Test case details. c denotes the number of crossing conductors (5 5 means conductors), n represents the number of panels per wire width, and N represents the total number of panels. Test c n N test K test K test K test K test K test K DRAM memory, and one Nvidia GeForce GTX48 GPU with.5gb device memory. Table demonstrates the CPU and GPU runtime results for five key steps of the FMM algorithm, as well as the total runtime of a complete FMM iteration. Since the coefficient matrices associated with the P reconditioning and Direct passes have similar coefficient matrix dimensions that do not require coefficient matrix decompositions, clustering technique is only applied to the Downward and Evaluation passes. From Table, we observe X to 45X speedups for P reconditioning passes, 3X to 45X speedups for Direct passes,.x to 4X speedups for Upwardpasses (non-critical kernel), 5X to 33X speedups for Downward passes, X to 3X speedups for Evaluation passes, and 8X to 3X speedups for the overall FMM iteration. Furthermore, Table 3 shows the results of complete capacitance extraction runs, where the original CPU-based FMM algorithm and our FMMGpu algorithm have been run for all test cases. As observed, both algorithms converge in the same number of GMRES iterations. Including the CPU-based computations such as the calculation of 56

6 Table : Runtime results of key FMM steps on CPU and GPU after cube clustering. Test Preconditioning (ms) Direct (ms) Upward (ms) Downward (ms) Evaluation (ms) Total (ms) CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU test 8..34(4X) 6..6(5X).6.48(.X) 4..(X) 38..9(X) (8X) test 3..65(X) 6..78(3X) (3X) 45..9(5X) (7X) (8X) test (3X) 9..55(9X) 4..7(3.5) (8X) (9X) 57..7(X) test (3X) (7X) 5..8(4X) (6X) (3X) 34..8(6X) test (35X) 69..3(3X) 5..6(3X) (8X) (6X) (6X) test6 5..3(45X)..(45X) 6..68(3.5X) (33X) (4X) (3X) Table 3: Capacitance extraction results on CPU and GPU (clustering-based). N i is the number of GM- RES iterations for a given error tolerance. Test FMMCpu Clustering-based FMMGpu N i Time(s) N i Time(s) test (8X) test (8X) test (X) test (6X) test (5X) test (9X) RuntimeofsingleFMM iterationongpu(ms) 8 3X/3X 6 6X/9X 4 6X/9X X/33X 8X/9X clusteringw/oconcurrent concurrent kernelexecution 8 8X/X clusteringw/concurrent 6 kernelexecutionexecution 4 test test test3 test4 test5 test6 Figure 6: FMMGpu runtime w/ & w/o using workload balancing and concurrent kernel executions. GMRES residues, as well as the one-step Arnoldi process on CPU, the total runtime speedups of capacitance extraction on GPU are slightly smaller than the speedup numbers obtained for the single FMM iteration runs shown in Table. Finally, Fig. 6 shows the GPU runtime and speedup results of a single FMM iteration (SPMV) w/ and w/o using the proposed workload balancing and concurrent kernel execution schemes. We obtain 8X to 3X speedups w/o using concurrent kernel executions, and X to 3X speedups by using concurrent kernel executions. It should be noted that running FMM on multi-core CPU may bring very limited performance improvement (3X speedups are reported on a quad-core machine [4]), while our FMMGpu capacitance extraction can easily brings more than X speedups for all test cases which achieves much higher runtime and energy efficiencies. 5. CONCLUSIONS In this paper, we present a GPU accelerated fast multipole algorithm, FMMGpu, for fast parallel 3-D capacitance extraction. As shown in extensive experiments, our proposed GPU-friendly FMMGpu algorithm flow and data structures allow highly efficient massively parallel computing on GPU. We obtain up to 3X speedups by running the capacitance extraction program on GPU when compared with CPU-based serial executions. A simple yet effective workload balancing method is also proposed to facilitate concurrent kernel executions on the latest Fermi GPUs, which further improves the parallel FMM computing efficiency by a factor of % to 6% for a set of small test cases. 6. REFERENCES [] K. Nabors and J. White. FastCap: a multipole accelerated 3-D capacitance extraction program. IEEE Trans. on Computer-Aided Design, (): , Nov. 99. [] J. Phillips and J. White. A precorrected-fft method for electrostatic analysis of complicated 3-D structures. IEEE Trans. on Computer-Aided Design, 6():59 7, Oct [3] W.Shi,J.Liu,N.Kakani,andT.Yu.Afast hierarchical algorithm for 3-D capacitance extraction. In IEEE/ACM DAC, pages 7, June 998. [4] F. Gong, H. Yu, and L. He. Picap: A parallel and incremental capacitance extraction considering stochastic process variation. In IEEE/ACM DAC, pages , Jul. 9. [5] R. Iverson and Y. Le Coz. A Stochastic Algorithm for High Speed capacitance Extraction in Integrated Circuits. Solid-State Electronics, 35(7):5, 99. [6] T. El-Moselhy, I. Elfadel, and L. Daniel. A hierarchical floating random walk algorithm for fabric-aware 3D capacitance extraction. In IEEE/ACM ICCAD, pages , 9. [7] NVIDIA Corporation. Fermi compute architecture white paper. [Online]. Available: architecture.html,. [8] N. Gumerov and R. Duraiswami. Fast multipole methods on graphics processors. J. Comput. Phys., 7(8):89 833, 8. [9] T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 4 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In SC 9, pages, 9. [] K.Nabors,S.Kim,andJ.White.Fastcapacitance extraction of general three-dimensional structures. IEEE Trans. on Microwave Theory and Techniques, 4(7):496 56, Jul. 99. [] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 73():35 348, 987. [] A. Appel. An efficient program for many-body simulation. SIAM Journal on Scientific and Statistical Computing, 6():85 3, 985. [3] NVIDIA Corporation. NVIDIA CUDA C programming guide. [Online]. Available: 563

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers

A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers Yanhong Yuan and Prithviraj Banerjee Department of Electrical and Computer Engineering

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011 1823 Parallel On-Chip Power Distribution Network Analysis on Multi-Core-Multi-GPU Platforms Zhuo Feng, Member,

More information

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis Electrical Interconnect and Packaging Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis Jason Morsey Barry Rubin, Lijun Jiang, Lon Eisenberg, Alina Deutsch Introduction Fast

More information

GPU-Friendly Floating Random Walk Algorithm for Capacitance Extraction of VLSI Interconnects

GPU-Friendly Floating Random Walk Algorithm for Capacitance Extraction of VLSI Interconnects GPU-Friendly Floating Random Walk Algorithm for Capacitance Extraction of VLI Interconnects Kuangya Zhai, Wenian Yu and Hao Zhuang Tsinghua National Laboratory for Information cience and Technology, Department

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

TinySPICE Plus: Scaling Up Statistical SPICE Simulations on GPU Leveraging Shared-Memory Based Sparse Matrix Solution Techniques

TinySPICE Plus: Scaling Up Statistical SPICE Simulations on GPU Leveraging Shared-Memory Based Sparse Matrix Solution Techniques TinySPICE Plus: Scaling Up Statistical SPICE Simulations on GPU Leveraging Shared-Memory Based Sparse Matrix Solution Techniques ABSTRACT Lengfei Han Department of ECE Michigan Technological University

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture SOTIRIOS G. ZIAVRAS and CONSTANTINE N. MANIKOPOULOS Department of Electrical and Computer Engineering New Jersey Institute

More information

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

A Fast Hierarchical Algorithm for Three-Dimensional Capacitance Extraction

A Fast Hierarchical Algorithm for Three-Dimensional Capacitance Extraction 330 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 3, MARCH 2002 A Fast Hierarchical Algorithm for Three-Dimensional Capacitance Extraction Weiping Shi, Member,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu

More information

New Multipole Method for 3-D Capacitance Extraction

New Multipole Method for 3-D Capacitance Extraction July 2004, Vo1.19, No.4, pp.544-549 J. Comput. Sci. ~ Technol. New Multipole Method for 3-D Capacitance Extraction Zhao-Zhi Yang and Ze-Yi Wang Department of Computer Science and Technology, Tsinghua University,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop

More information

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

Reducing Communication Costs Associated with Parallel Algebraic Multigrid Reducing Communication Costs Associated with Parallel Algebraic Multigrid Amanda Bienz, Luke Olson (Advisor) University of Illinois at Urbana-Champaign Urbana, IL 11 I. PROBLEM AND MOTIVATION Algebraic

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC) Science Journal of Circuits, Systems and Signal Processing 2013; 2(2) : 22-28 Published online April 2, 2013 (http://www.sciencepublishinggroup.com/j/cssp) doi: 10.11648/j.cssp.20130202.11 Novel implementations

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Large Displacement Optical Flow & Applications

Large Displacement Optical Flow & Applications Large Displacement Optical Flow & Applications Narayanan Sundaram, Kurt Keutzer (Parlab) In collaboration with Thomas Brox (University of Freiburg) Michael Tao (University of California Berkeley) Parlab

More information

The Barnes-Hut Algorithm in MapReduce

The Barnes-Hut Algorithm in MapReduce The Barnes-Hut Algorithm in MapReduce Ross Adelman radelman@gmail.com 1. INTRODUCTION For my end-of-semester project, I implemented an N-body solver in MapReduce using Hadoop. The N-body problem is a classical

More information

Parallelization of K-Means Clustering Algorithm for Data Mining

Parallelization of K-Means Clustering Algorithm for Data Mining Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Communication-Avoiding Optimization of Geometric Multigrid on GPUs Communication-Avoiding Optimization of Geometric Multigrid on GPUs Amik Singh James Demmel, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-258

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

NAMD Serial and Parallel Performance

NAMD Serial and Parallel Performance NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

Analysis and Visualization Algorithms in VMD

Analysis and Visualization Algorithms in VMD 1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Dense matching GPU implementation

Dense matching GPU implementation Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Transactions on Modelling and Simulation vol 20, 1998 WIT Press, ISSN X

Transactions on Modelling and Simulation vol 20, 1998 WIT Press,   ISSN X Parallel indirect multipole BEM analysis of Stokes flow in a multiply connected domain M.S. Ingber*, A.A. Mammoli* & J.S. Warsa* "Department of Mechanical Engineering, University of New Mexico, Albuquerque,

More information

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Terascale on the desktop: Fast Multipole Methods on Graphical Processors Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)

More information

Computational Science and Engineering (Int. Master s Program)

Computational Science and Engineering (Int. Master s Program) Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis A GPU-based Multi-level Subspace Decomposition Scheme for Hierarchical Tensor Product Bases

More information

Reconstruction Improvements on Compressive Sensing

Reconstruction Improvements on Compressive Sensing SCITECH Volume 6, Issue 2 RESEARCH ORGANISATION November 21, 2017 Journal of Information Sciences and Computing Technologies www.scitecresearch.com/journals Reconstruction Improvements on Compressive Sensing

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

A New Methodology for Interconnect Parasitic Extraction Considering Photo-Lithography Effects

A New Methodology for Interconnect Parasitic Extraction Considering Photo-Lithography Effects A New Methodology for Interconnect Parasitic Extraction Considering Photo-Lithography Effects Ying Zhou, Yuxin Tian, Weiping Shi Texas A&M University Zhuo Li Pextra Corporation Frank Liu IBM Austin Research

More information

Parallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgeting

Parallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgeting Parallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgeting Xueqian Zhao, Yonghe Guo, Zhuo Feng and Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University,

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs M. J. McNenly and R. A. Whitesides GPU Technology Conference March 27, 2014 San Jose, CA LLNL-PRES-652254! This work performed under

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND

SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

Multilevel Summation of Electrostatic Potentials Using GPUs

Multilevel Summation of Electrostatic Potentials Using GPUs Multilevel Summation of Electrostatic Potentials Using GPUs David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information