Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms
|
|
- Roxanne West
- 5 years ago
- Views:
Transcription
1 Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Xueqian Zhao Department of ECE Michigan Technological University Houghton, MI, 4993 ABSTRACT To facilitate full chip capacitance extraction, field solvers are typically deployed for characterizing capacitance libraries for various interconnect structures and configurations. In the past decades, various algorithms for accelerating boundary element methods (BEM) have been developed to improve the efficiency of field solvers for capacitance extraction. This paper presents the first massively parallel capacitance extraction algorithm FMMGpu that accelerates the well-known fast multipole methods (FMM) on modern Graphics Processing Units (GPUs). We propose GPUfriendly data structures and SIMD parallel algorithm flows to facilitate the FMM-based 3-D capacitance extraction on GPU. Effective GPU performance modeling methods are also proposed to properly balance the workload of each critical kernel in our FMMGpu implementation, by taking advantage of the latest Fermi GPU s concurrent kernel executions on streaming multiprocessors (SMs). Our experimental results show that FMMGpu brings X to 3X speedups in capacitance extractions for various test cases. We also show that even for small test cases that may not well utilize GPU s hardware resources, the proposed cube clustering and workload balancing techniques can bring % to 6% extra performance improvements. Categories and Subject Descriptors B.7. [Integrated Circuits]: Design Aids Simulation General Terms Algorithms, Design Keywords Capacitance extraction, parallel fast multipole method, GPU. INTRODUCTION Nowadays high-performance integrated circuit (IC) designs require efficient and accurate extraction of interconnect parasitics including resistance, inductance, and capacitance that can be further integrated with other circuit com- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC, June 5-,, San Diego, California, USA Copyright ACM //6...$.. Zhuo Feng Department of ECE Michigan Technological University Houghton, MI, 4993 zhuofeng@mtu.edu ponents for SPICE simulations. Since on-chip capacitance will significantly impact circuit performance such as operating speed, functionality, etc, developing efficient algorithms for large scale capacitance extraction problems becomes increasingly critical considering the aggressive semiconductor technology scalings. A variety of fast capacitance extraction approaches has been proposed in the past few decades that usually falls into two categories: the boundary element methods (BEM) [ 4] and floating random walk algorithms (FRW) [5,6], among which BEM methods are suitable for extracting coupling capacitance while FRW methods are typically adopted for calculating self-capacitance. The dramatic evolution of present day multi-/many core processors [7] brings huge opportunities for accelerating the computing-intensive capacitance extraction tasks. However, there has been little achievement made in the past few years for leveraging such parallel computing platforms considering the grand challenges brought by the very complicated interconnect structures as well as parallel algorithm complexity. For instance, in [4], an FMM-based capacitance extraction program merely achieves up to 3X speedups on a quadcore CPU over the serial extraction program. On the other hand, recent general purpose computing on graphics processors (GPUs) has become a very popular research area, which provides desired energy efficient parallel computing and, if suitably developed, retains much higher computing throughput than existing multi-core CPUs. The latest Fermi GPUs [7] integrate more than 5 cores into a single chip and deliver greater than one TFlops ( floating point operations) peak computing performances. It is therefore quite desired to leverage the latest GPU s computing power by developing energy efficient GPU-based algorithms for solving large scale capacitance extraction problems. Until recent years, there are a few research projects which focused on developing parallel FMM algorithms on GPUs [8, 9]. However, existing GPU-based FMM methods merely exploit parallel computation of multipole and local expansion coefficients on-the-fly during each sparse matrix-vector multiplication (SPMV), without optimizing the GPU data structures or algorithm flows according to the latest GPU hardware properties, which can not be directly deployed for efficient acceleration of GPU-based 3-D capacitance extraction programs. In this work, for the first time we propose GPU-friendly data structures and FMM algorithm flows which can effectively minimize the CPU-GPU data transferring cost for the key FMM kernel functions that include the preconditioning, direct summation, upward/downward as well as evaluation passes. We also show simple yet effective GPU performance modeling and workload balancing techniques for properly distributing FMM s computing tasks among GPU s streaming multiprocessors (SMs), which allows to more efficiently leverage the latest GPU s concurrent kernel executions in the capacitance extraction tasks. 558
2 . BACKGROUND. Capacitance Extraction Problem Consider a system containing m ideal conductors embedded in a homogeneous dielectric medium. The surfaces or edges of all conductors are broken into small panels or tiles. It is also assumed that on each panel i, the charge q i is uniformly distributed. The potential of each evaluation panel (receiver) i can be obtained by summing up the potential contributions from all other panel charges (unknowns) using specific Green s functions. The final capacitance values of those m conductors can be summarized by an m m capacitance matrix C, where the diagonal entry C ii represents the self-capacitance value for conductor i, and the off-diagonal entry C ij represents the coupling capacitance between conductors i and j. The j-th column of C can be computed by finding the charge distributions on all conductors when the j-th conductor is raised to unit potential and others are grounded. In fact, the charge distributions on the surface of the conductor can be solved using the first-kind integral equation: ψ(x) = G(x, x )σ(x )da, () surfaces where x, x R 3 denote the receiver and source locations, σ denotes the surface charge density, da is the incremental surface area, ψ is the surface potential which is known, and i= x x G(x, x ) denotes the Green s function which is in free space []. Assume that m conductors are discretized into a total of n panels. Then the potential on each evaluation panel k (receiver) can be computed by: n σ i(x ) p k = panel i x x k da, () where x k is the center of evaluation panel k, x is the position on the surface of panel i, p k is the potential at the center of the evaluation panel k, σ i(x ) is the surface charge density on the panel i, andda is the incremental surface area of panel i.applying the above formula for all the n panels, Eqn. () can be expressed using the following linear system of equations: Φ q = p, (3) where Φ R n n, while q and p R n are the charge and potential vectors. Since the charges are uniformly distributed on every panel, the entries of Φ matrix can be obtained by Φ (k,i) = panel i x x da,wherek denotes the index of the evaluation panel, and i denotes the index of source panel. When solving for C ij, panel potentials for conductor i are raised to while other conductors panel potentials are set to be. Subsequently the unknown charge vector q can be obtained by solving Eqn. (3). Next, the ij-th element C ij of the capacitance matrix C can be obtained by summing up all the panel charges on the i-th conductor []: C ij = q k, (4) k conductor i. The Fast Multipole Algorithm (FMM) There are many existing techniques proposed to numerically approximate the matrix-vector multiplication Φ q in Eqn. (3), including the Fastcap algorithm [, ] based on fast multipole methods [], the hierarchical capacitance extraction algorithm [3] based on Appel s algorithm [], and the precorrected-fft method [] which projects panel charges onto 3-D regular grid points and evaluates the distant charge points interactions through FFT computations. Running gtime Kernel Kernel Kernel3 Kernel4 Kernel5 Kernel6 SMOccupancy SerialKernelExecution Running gtime Kernel Kernel Kernel3 (cont.) Kernel5 SMOccupancy Kernel4 Kernel6 Kernel 3 ConcurrentKernelExecution Figure : Concurrent kernel executions on Fermi GPU and serial kernel executions on previous GPUs. The fast multipole methods (FMM) have been widely used for solving general N-body problems [], which can be applied to compute panel k s total potential: p k = Φ (k,i) q i + Φ (k,i) q i, x i,x k R d, (5) x i / Ω(x k) x i Ω(x k) where {x i} denote the centers of source panels, {x k } denote the centers of evaluation panels, d is the dimensionality of the problem and Ω(x k ) is some neighborhood of evaluation panel k. In Eqn. (5), the latter multiplication term Φ (k,i) q i with x i Ω(x k ), is performed directly. All other FMM steps are dedicated to approximating the term Φ (k,i) q i with x i / Ω(x k ) using the multipole and local expansions, according to a user supplied error tolerance ε..3 Nvidia Fermi GPU The latest Nvidia Fermi GPU architecture provides more opportunities and flexibilities for general computing-intensive tasks [7]. Compared with previous GPU models, the Fermi GPU from Nvidia has increased the number of streaming processors (SPs) in each streaming multi-processor (SM) from 8 to 3, resulting in totally 5 streaming processors in a GPU. More importantly, the new GPU model also supports high performance double precision computing and concurrent kernel executions. On Fermi GPUs, up to 6 kernels can be concurrently launched on the 6 SMs [7], whereas in previous GPU architectures, only one kernel can be launched at a time on GPU. As shown in Fig., running computing tasks concurrently is typically more efficient than running them in series, especially when each single task can not fully occupy all the SMs of a GPU. In this work, we propose a simple workload balancing method which can effectively assign computing tasks among the streaming multiprocessors (SMs) and take advantage of Fermi GPU s concurrent kernel executions. 3. FAST MULTIPOLE METHOD ON GPU 3. FMM Computation Decomposition In FMM methods, there are five key steps. The computations associated with each of the five steps have been concluded as follows. The preconditioning pass requires solving the panel charges by directly inverting the approximate blockdiagonal potential matrix Φ in Eqn. (3) obtained by including the panels that belong to the overlapped neighboring cubes. In this step, the potential matrix inverses are computed in advance, and subsequently many small dense matrix-vector multiplications are performed. It should be noted that the dense matrices 559
3 sizes are similar in this step, and the computation time is roughly % of the overall FMM runtime. The direct pass directly sums up all the potentials contributed by source panel charges that locate in the self and neighboring cubes. Therefore, similarly to the preconditioning pass, banded sparse matrix-vector multiplications are needed. The computing time spent in this step is around % of the total FMM runtime. The upward pass generates the multipole expansions for the finest level cubes and converts them to the expansions for the coarser to coarsest level cubes. Computations involved in this step are less expensive (%) when compared with the cost of other FMM steps. This step can be parallelized in a level-by-level manner. The downward and evaluation passes involve the most expensive and time-consuming computations in the FMM flow, which usually take more than 7% of the total runtime. The potential contributions from the cube panels which are not included in the self and neighboring cubes are evaluated in this step. The major computations can be done using small dense matrixvector multiplications, but the dense matrix sizes for each evaluation cube can be quite different depending on the orders of multipole expansions to be used. 3. Coefficient Matrices in FMM algorithm In the capacitance extraction problem, conductor surfaces are first discretized into many small cubes, and each of the cubes is further decomposed into several panels that hold uniformly distributed charges. As described in [], in typical FMM method, a coefficient matrix Φ (k,j) R n m can be used to compute the panel potentials of the evaluation cube k (receivers) contributed by the charges or expansion sources inthesourcecubej, wheren is the number of panels in evaluation cube k, andm is the number of panel charges or expansions associated with source cube j. Such coefficient matrices may fall into several categories: QP matrix (Φ QP ): projects charges to potentials; QM matrix (Φ QM ): expansions; projects charges to multipole MM matrix (Φ MM ): converts the finest level multipole expansions to coarser and coarsest levels multpole expansions; LL matrix (Φ LL): translates the coarsest local expansions to finer and finest level local expansions; ML matrix (Φ ML): projects multipole expansions to local expansions. MP matrix (Φ MP ): projects multipole expansions to potentials. LP matrix (Φ LP ): projects local expansions to potentials. In the above QP, MP and LP matrices, the numbers of rows of each matrix are equal to the number of panels in the evaluation (receiver) cube, while the number of columns depends on the order of multipole and local expansions, as well as the number of charges associated with the source cube. Directcharges ofsourcecubei q i m Panelpotentialsin n cubek: p k Multipoleexpansions ofsourcecubej s m j ( k, i) ( k, j) p k QP q i M P m q i ( k, i) ( k, j) j p k Q P M P m j ( ki, ) nm QPcoefficientmatrix: QP MPcoefficientmatrix: ( kj, ) ns I I ( ki, ) ( k, j) ( ki, ) ( k, j) p k Q P M P Q P M P I (, lk) rd (, lk) rd P P p l Q Q p mix I mix QPindexmatrix: ( ki, ) I QP nm MPindexmatrix:I ( k, j) MP T T T T T T T Globalvector: Vec global q q x m m y l l z Figure : Hierarchical coefficient/index matrix compositions for GPU-friendly FMM computations. 3.3 GPU-Friendly Data Structure When FMM algorithm is applied to capacitance extraction problems, the order of multipole expansions may vary according to the desired accuracy level. It has been shown that second order multipole expansions in FMM are sufficient for achieving the desired accuracy levels. Since the typical coefficient matrices in FMM are very small and the number of columns may vary significantly, processing these small coefficient matrices on GPU may not be efficient considering GPU s single-instruction-multiple-data (SIMD) computing scheme and relatively large device memory access latencies. In this work, we propose a GPU-friendly data structure for efficiently storing and processing the coefficient matrices on GPU, specifically for capacitance extraction problems. In reality, for a specific evaluation cube, there can be many source cubes that contribute to the total potentials of its panels. Since all the coefficient matrices Φ (k,j) (j =,..., t) associated with the evaluation cube k always have the same row dimensions (the number of receivers), we can pack all source cube related coefficient matrices together to form a larger coefficient matrix Φ k =[Φ (k,)... Φ (k,t) ] by appending all those small coefficient matrices Φ (k,j) in the column dimension, where t is the number of source cubes that will contribute to the total potentials of the panels in the evaluation cube k. Fig. demonstrates how to combine all the FMM coefficient matrices together to form a larger one, where p k R n denotes the potential vector of the evaluation cube k, Φ (k,i) QP and Φ(k,j) MP denote the QP and MP coefficient matrices for the evaluation cube k and source cubes i and j, q i R m isthechargevectorinsourcecubei, and m j R s is the multipole expansion vector of cube j. After combining the above coefficient matrices into a larger coefficient matrix, we get the resultant coefficient matrix ns MP Φ mix R U V, where U is the total number of evaluation cubes and V is the maximum column dimension of Φ k. We also form a global source cube vector Vec global that includes all panel charges, multipole and local expansions for all coarsest to finest levels. Meanwhile, an index matrix is proposed to locate the source cube related charges and ex- 56
4 6 x4 6 x4 Cubeindicesafterordering Pan nelindices4 Pan 3 nelindices4 5 5 NumberofCoefficientsbeforeorderingg 5 5 NumberofCoefficientsafterorderingg Figure 3: Comparison of non-zero entries of the coefficient matrices before/after cube sorting. pansion coefficients stored in the global vector Vec global. 3.4 Accelerating FMM on GPU As discussed in Section 3., five types of FMM computations are all related to dense coefficient matrix-vector multiplications. So once we stored the coefficient matrices on GPUs, all steps can be performed in very similar manners. For the preconditioning and direct passes, the coefficient matrices are small and having similar sizes. On the contrary, for the downward and evaluation passes, the coefficient matrix sizes can be quite different for different cubes. We describe how to accelerate the evaluation pass on GPU in details, while other preconditioning, direct, upward and downward passes can be performed in the similar ways. The total potential for each evaluation panel can be obtained by summing up all the potential contributions from all the source cubes (including the charge contributions and expansion source contributions) in a very efficient way using GPU s hundreds of streaming processors (SPs). As shown in Fig., we define an element-wise operator to achieve this goal on GPU. The element-wise operation for the evaluation pass can be performed in the following steps:. Load panel charges, multipole or local expansions from Vec global according to their corresponding indices stored in I mix (as shown in Fig. );. Multiply the loaded the charges or expansions with their corresponding coefficient matrix elements stored in Φ mix; 3. Sum up the multiplication results for each row and store the final result into the panel potential vector p. 3.5 Cube Clustering We want to emphasize that all the FMM coefficient matrices, as well as index matrices need to be generated and transferred to GPU for one time before the capacitance extraction algorithm starts. Subsequently for each GMRES iteration using FMM, only two GPU-CPU data transfers of the potential vector p and the charge vector q are required. Since the sizes of the above vectors are fairly small (equal to the number of panels), the total data communication time between CPU and GPU can be negligible when compared with the overall runtime of the GPU-based FMM computations. As mentioned in Section 3., in realistic capacitance extraction problems, different evaluation cubes may be influenced by different numbers of source cubes or expansion sources, which may lead to drastically different column dimensions of the coefficient matrices Φ k. As an example, in Fig. 3 we show the numbers of non-zero coefficients for all the panels obtained from a realistic capacitance extraction problem, where the black area reflects the numbers of nonzero coefficients in the coefficient matrix. To avoid GPU Figure 4: clusters. Orig.CoefficientMatrix ClusteredMatrices NumberofCoefficients 8SMs 3SMs SMs SMs Concurrent kernel executions for cube thread branching and inefficient GPU memory access patterns, we can fill dummy zero elements into the coefficient matrix Φ mix, as well as the index matrix I mix (Fig. ). However, this will result in lower memory and computation efficiencies, especially when the column dimension of the coefficient matrix for each panel varies dramatically from one to another. To further improve the memory efficiency, and minimize the number of dummy coefficients in these coefficient matrices, and achieve better workload balancing during GPU s parallel computing, a simple yet effective cube clustering technique is adopted to decompose the original coefficient matrix into several smaller coefficient matrices (of cube clusters). During the coefficient matrix decomposition (cube clustering) step, each of the new matrix clusters should maintain a sufficiently large row dimension to fully occupy at least one GPU s streaming multiprocessor (SM). The cube clustering step can be done by putting together those cubes whose Φ k matrices have the similar column dimensions, into the same cluster. The resultant clusters again form new coefficient matrices whose dummy elements are significantly reduced compared to the original coefficient matrix (as shown in Fig. 4). Although both memory occupancy and GPU runtime performance can be further improved by using cube clustering technique, for very small test cases, coefficient matrices after clustering can still be small which may not fully utilize all the computing resources of streaming multiprocessors (SMs) on GPU if processed one after one. To gain higher GPU computing efficiency, in the following section, we propose an efficient workload balancing method for concurrent kernel executions on the latest Fermi GPUs, which allows processing more than one cube cluster on GPU concurrently (as shown in Fig. 4). 3.6 Workload Balancing on GPU As the problem size increases, more GPU computational time for a kernel is needed. As shown in Fig. 5, the blue circles sitting on the solid line denote the ideal linear speedup runtimes (parallel efficiency is.) when using different numbers of streaming multiprocessors, while the red dotted curve denotes the measured runtime results for the FMM downward pass kernel which implies a super linear speedup performance when using less than four SMs (parallel efficiency is greater than.). Although at the first glance it is hard to imagine such a super linear speedup performance since the runtime should decrease linearly with the number of parallel processors, the observed high parallel efficiency can be actually achieved by carefully utilizing the on-chip cache memory and processing resources of the parallel computing platforms. However, when more and more SMs are used for executing the kernel, the parallel efficiency may go down 56
5 T T / T /3 T /4 Runtime LinearSpeedups R RealisticSpeedups i S d T /n SuperlinearSpeedupRange 3 4 n NumberofStreamingMultiprocessors(SMs) Figure 5: Runtime scalability for processing a cube cluster on multiple SMs during the downward pass. quickly, since some of the SMs can be idle due to insufficient computation tasks. From Fig. 5, we observe that when more than four SMs are assigned to execute the FMM downward pass kernel, the GPU runtime starts to saturate and the parallel efficiency can go down drastically. In order to achieve the optimal parallel computing efficiency using the latest Fermi GPU s concurrent kernel executions, we first characterize the optimal SM assignments for typical cluster sizes by running a group of tests. Then the optimal number of SMs (the super linear speedup region illustrated in Fig. 5) to be used for processing a specific cube cluster can be easily obtained. Next, in each of the following FMM procedures, the final SM assignment can be determined based on the actual coefficient matrix cluster sizes, and the prior characterized optimal numbers of SMs: N i = N sm Si, (6) p S i where N i denotes the number of SMs finally assigned to execute kernel i (for a cube cluster), N sm denotes the total number of SMs available on GPU, S i denotes the measured optimal number of SMs for running kernel i, andp denotes the total number of kernels (cube clusters) to be executed at the same time. The above simple performance modeling and workload balancing method has significantly improved our FMMGpu throughput, especially for the small test cases (% to 6% improvement has been observed), since the serial processing of these small cube clusters may not fully occupy all the SMs on GPU. However, for very large test cases whose coefficient matrices (of a cube cluster) are large enough to occupy all the SMs, the above workload balancing is not necessary. The proposed GPU-based capacitance extraction algorithm FMMGpu has been concluded in Algorithm, where K is the user supplied maximum number of GMRES iterations, ro k is the residual in the k-th GMRES iteration and tol is the error tolerance set by the user. 4. EXPERIMENT RESULTS Extensive experiments have been conducted to validate the proposed GPU-based fast multipole method for capacitance extraction. A set of bus crossing test cases are created in our experiments, and details are shown in Table. For all thetestcases,wesettheedgetoinnerpanelwidthratioto be., and the multipole expansion order to be. The proposed method is implemented in C and the GPU programming language CUDA [3]. All experiments are performed on Ubuntu bit with.66ghz quad-core CPU, 6GB i= Algorithm FMMGpu Algorithm Flow : Pack all the coefficient matrices for the evaluation cube i (receiver) into a single matrix Φ i for i =ton. : Order the cubes i (i =,..., n) based on the column dimensions of the coefficient matrices Φ i. 3: Cluster Φ i (i =,..., n) into several new coefficient matrices and generate corresponding index matrices. 4: Transfer the coefficient matrices and index matrices to GPU s global memory. 5: Characterize the optimal numbers of SMs for workload balancing by running a few small test cases. 6: Start the following FMMGpu computation and GMRES iterations: 7: for (k = ; k < K; k ++): do 8: Transfer the panel charge vector q to GPU memory. 9: Execute the P reconditioning pass on GPU and store the updated charges into Vec global. : Execute the Direct pass on GPU and store the direct potential contributions into p. : Execute the Upward pass on GPU and store the multipole expansions into Vec global. : Execute Downward pass on GPU with workload balancing and concurrent kernel executions. Then store the local expansions into Vec global. 3: Execute Evaluation pass on GPU with workload balancing and concurrent kernel executions. Then sum the potentials on the evaluation panels to p. 4: Transfer the potential vector p back to CPU for the k-th GMRES iteration. 5: Compute residual r o (k) after this GMRES iteration, and check the convergence: 6: if r (k) o < tol then 7: Exit the loop and compute the capacitance matrix values. 8: end if 9: end forreturn all the computed capacitance matrix elements. Table : Test case details. c denotes the number of crossing conductors (5 5 means conductors), n represents the number of panels per wire width, and N represents the total number of panels. Test c n N test K test K test K test K test K test K DRAM memory, and one Nvidia GeForce GTX48 GPU with.5gb device memory. Table demonstrates the CPU and GPU runtime results for five key steps of the FMM algorithm, as well as the total runtime of a complete FMM iteration. Since the coefficient matrices associated with the P reconditioning and Direct passes have similar coefficient matrix dimensions that do not require coefficient matrix decompositions, clustering technique is only applied to the Downward and Evaluation passes. From Table, we observe X to 45X speedups for P reconditioning passes, 3X to 45X speedups for Direct passes,.x to 4X speedups for Upwardpasses (non-critical kernel), 5X to 33X speedups for Downward passes, X to 3X speedups for Evaluation passes, and 8X to 3X speedups for the overall FMM iteration. Furthermore, Table 3 shows the results of complete capacitance extraction runs, where the original CPU-based FMM algorithm and our FMMGpu algorithm have been run for all test cases. As observed, both algorithms converge in the same number of GMRES iterations. Including the CPU-based computations such as the calculation of 56
6 Table : Runtime results of key FMM steps on CPU and GPU after cube clustering. Test Preconditioning (ms) Direct (ms) Upward (ms) Downward (ms) Evaluation (ms) Total (ms) CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU test 8..34(4X) 6..6(5X).6.48(.X) 4..(X) 38..9(X) (8X) test 3..65(X) 6..78(3X) (3X) 45..9(5X) (7X) (8X) test (3X) 9..55(9X) 4..7(3.5) (8X) (9X) 57..7(X) test (3X) (7X) 5..8(4X) (6X) (3X) 34..8(6X) test (35X) 69..3(3X) 5..6(3X) (8X) (6X) (6X) test6 5..3(45X)..(45X) 6..68(3.5X) (33X) (4X) (3X) Table 3: Capacitance extraction results on CPU and GPU (clustering-based). N i is the number of GM- RES iterations for a given error tolerance. Test FMMCpu Clustering-based FMMGpu N i Time(s) N i Time(s) test (8X) test (8X) test (X) test (6X) test (5X) test (9X) RuntimeofsingleFMM iterationongpu(ms) 8 3X/3X 6 6X/9X 4 6X/9X X/33X 8X/9X clusteringw/oconcurrent concurrent kernelexecution 8 8X/X clusteringw/concurrent 6 kernelexecutionexecution 4 test test test3 test4 test5 test6 Figure 6: FMMGpu runtime w/ & w/o using workload balancing and concurrent kernel executions. GMRES residues, as well as the one-step Arnoldi process on CPU, the total runtime speedups of capacitance extraction on GPU are slightly smaller than the speedup numbers obtained for the single FMM iteration runs shown in Table. Finally, Fig. 6 shows the GPU runtime and speedup results of a single FMM iteration (SPMV) w/ and w/o using the proposed workload balancing and concurrent kernel execution schemes. We obtain 8X to 3X speedups w/o using concurrent kernel executions, and X to 3X speedups by using concurrent kernel executions. It should be noted that running FMM on multi-core CPU may bring very limited performance improvement (3X speedups are reported on a quad-core machine [4]), while our FMMGpu capacitance extraction can easily brings more than X speedups for all test cases which achieves much higher runtime and energy efficiencies. 5. CONCLUSIONS In this paper, we present a GPU accelerated fast multipole algorithm, FMMGpu, for fast parallel 3-D capacitance extraction. As shown in extensive experiments, our proposed GPU-friendly FMMGpu algorithm flow and data structures allow highly efficient massively parallel computing on GPU. We obtain up to 3X speedups by running the capacitance extraction program on GPU when compared with CPU-based serial executions. A simple yet effective workload balancing method is also proposed to facilitate concurrent kernel executions on the latest Fermi GPUs, which further improves the parallel FMM computing efficiency by a factor of % to 6% for a set of small test cases. 6. REFERENCES [] K. Nabors and J. White. FastCap: a multipole accelerated 3-D capacitance extraction program. IEEE Trans. on Computer-Aided Design, (): , Nov. 99. [] J. Phillips and J. White. A precorrected-fft method for electrostatic analysis of complicated 3-D structures. IEEE Trans. on Computer-Aided Design, 6():59 7, Oct [3] W.Shi,J.Liu,N.Kakani,andT.Yu.Afast hierarchical algorithm for 3-D capacitance extraction. In IEEE/ACM DAC, pages 7, June 998. [4] F. Gong, H. Yu, and L. He. Picap: A parallel and incremental capacitance extraction considering stochastic process variation. In IEEE/ACM DAC, pages , Jul. 9. [5] R. Iverson and Y. Le Coz. A Stochastic Algorithm for High Speed capacitance Extraction in Integrated Circuits. Solid-State Electronics, 35(7):5, 99. [6] T. El-Moselhy, I. Elfadel, and L. Daniel. A hierarchical floating random walk algorithm for fabric-aware 3D capacitance extraction. In IEEE/ACM ICCAD, pages , 9. [7] NVIDIA Corporation. Fermi compute architecture white paper. [Online]. Available: architecture.html,. [8] N. Gumerov and R. Duraiswami. Fast multipole methods on graphics processors. J. Comput. Phys., 7(8):89 833, 8. [9] T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji. 4 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In SC 9, pages, 9. [] K.Nabors,S.Kim,andJ.White.Fastcapacitance extraction of general three-dimensional structures. IEEE Trans. on Microwave Theory and Techniques, 4(7):496 56, Jul. 99. [] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 73():35 348, 987. [] A. Appel. An efficient program for many-body simulation. SIAM Journal on Scientific and Statistical Computing, 6():85 3, 985. [3] NVIDIA Corporation. NVIDIA CUDA C programming guide. [Online]. Available: 563
Center for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationA Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers
A Parallel Implementation of A Fast Multipole Based 3-D Capacitance Extraction Program on Distributed Memory Multicomputers Yanhong Yuan and Prithviraj Banerjee Department of Electrical and Computer Engineering
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationA Kernel-independent Adaptive Fast Multipole Method
A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges
More informationIEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011 1823 Parallel On-Chip Power Distribution Network Analysis on Multi-Core-Multi-GPU Platforms Zhuo Feng, Member,
More informationAdvanced Surface Based MoM Techniques for Packaging and Interconnect Analysis
Electrical Interconnect and Packaging Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis Jason Morsey Barry Rubin, Lijun Jiang, Lon Eisenberg, Alina Deutsch Introduction Fast
More informationGPU-Friendly Floating Random Walk Algorithm for Capacitance Extraction of VLSI Interconnects
GPU-Friendly Floating Random Walk Algorithm for Capacitance Extraction of VLI Interconnects Kuangya Zhai, Wenian Yu and Hao Zhuang Tsinghua National Laboratory for Information cience and Technology, Department
More informationIterative methods for use with the Fast Multipole Method
Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationTinySPICE Plus: Scaling Up Statistical SPICE Simulations on GPU Leveraging Shared-Memory Based Sparse Matrix Solution Techniques
TinySPICE Plus: Scaling Up Statistical SPICE Simulations on GPU Leveraging Shared-Memory Based Sparse Matrix Solution Techniques ABSTRACT Lengfei Han Department of ECE Michigan Technological University
More informationFMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More information3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs
3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationFast Multipole Method on the GPU
Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationMatrix Multiplication on an Experimental Parallel System With Hybrid Architecture
Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture SOTIRIOS G. ZIAVRAS and CONSTANTINE N. MANIKOPOULOS Department of Electrical and Computer Engineering New Jersey Institute
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationCMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore
More informationA Fast Hierarchical Algorithm for Three-Dimensional Capacitance Extraction
330 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 3, MARCH 2002 A Fast Hierarchical Algorithm for Three-Dimensional Capacitance Extraction Weiping Shi, Member,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationAccelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms
Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms Liang Men, Miaoqing Huang, John Gauch Department of Computer Science and Computer Engineering University of Arkansas {mliang,mqhuang,jgauch}@uark.edu
More informationNew Multipole Method for 3-D Capacitance Extraction
July 2004, Vo1.19, No.4, pp.544-549 J. Comput. Sci. ~ Technol. New Multipole Method for 3-D Capacitance Extraction Zhao-Zhi Yang and Ze-Yi Wang Department of Computer Science and Technology, Tsinghua University,
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationParallel Implementation of 3D FMA using MPI
Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationGPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis
GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationNVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas
NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 7
General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationA Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors
A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationThe Fast Multipole Method on NVIDIA GPUs and Multicore Processors
The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied
More informationVery fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards
Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards By Allan P. Engsig-Karup, Morten Gorm Madsen and Stefan L. Glimberg DTU Informatics Workshop
More informationReducing Communication Costs Associated with Parallel Algebraic Multigrid
Reducing Communication Costs Associated with Parallel Algebraic Multigrid Amanda Bienz, Luke Olson (Advisor) University of Illinois at Urbana-Champaign Urbana, IL 11 I. PROBLEM AND MOTIVATION Algebraic
More informationCache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao
More informationXIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture
XIV International PhD Workshop OWD 2012, 20 23 October 2012 Optimal structure of face detection algorithm using GPU architecture Dmitry Pertsau, Belarusian State University of Informatics and Radioelectronics
More informationBlock Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations
Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationNovel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)
Science Journal of Circuits, Systems and Signal Processing 2013; 2(2) : 22-28 Published online April 2, 2013 (http://www.sciencepublishinggroup.com/j/cssp) doi: 10.11648/j.cssp.20130202.11 Novel implementations
More informationPhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.
Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences
More informationLarge Displacement Optical Flow & Applications
Large Displacement Optical Flow & Applications Narayanan Sundaram, Kurt Keutzer (Parlab) In collaboration with Thomas Brox (University of Freiburg) Michael Tao (University of California Berkeley) Parlab
More informationThe Barnes-Hut Algorithm in MapReduce
The Barnes-Hut Algorithm in MapReduce Ross Adelman radelman@gmail.com 1. INTRODUCTION For my end-of-semester project, I implemented an N-body solver in MapReduce using Hadoop. The N-body problem is a classical
More informationParallelization of K-Means Clustering Algorithm for Data Mining
Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com
More informationOn the limits of (and opportunities for?) GPU acceleration
On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationCommunication-Avoiding Optimization of Geometric Multigrid on GPUs
Communication-Avoiding Optimization of Geometric Multigrid on GPUs Amik Singh James Demmel, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-258
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationSpace Filling Curves and Hierarchical Basis. Klaus Speer
Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of
More information1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3
6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationNAMD Serial and Parallel Performance
NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationAnalysis and Visualization Algorithms in VMD
1 Analysis and Visualization Algorithms in VMD David Hardy Research/~dhardy/ NAIS: State-of-the-Art Algorithms for Molecular Dynamics (Presenting the work of John Stone.) VMD Visual Molecular Dynamics
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationDense matching GPU implementation
Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationTransactions on Modelling and Simulation vol 20, 1998 WIT Press, ISSN X
Parallel indirect multipole BEM analysis of Stokes flow in a multiply connected domain M.S. Ingber*, A.A. Mammoli* & J.S. Warsa* "Department of Mechanical Engineering, University of New Mexico, Albuquerque,
More informationTerascale on the desktop: Fast Multipole Methods on Graphical Processors
Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)
More informationComputational Science and Engineering (Int. Master s Program)
Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis A GPU-based Multi-level Subspace Decomposition Scheme for Hierarchical Tensor Product Bases
More informationReconstruction Improvements on Compressive Sensing
SCITECH Volume 6, Issue 2 RESEARCH ORGANISATION November 21, 2017 Journal of Information Sciences and Computing Technologies www.scitecresearch.com/journals Reconstruction Improvements on Compressive Sensing
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationA New Methodology for Interconnect Parasitic Extraction Considering Photo-Lithography Effects
A New Methodology for Interconnect Parasitic Extraction Considering Photo-Lithography Effects Ying Zhou, Yuxin Tian, Weiping Shi Texas A&M University Zhuo Li Pextra Corporation Frank Liu IBM Austin Research
More informationParallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgeting
Parallel Hierarchical Cross Entropy Optimization for On-Chip Decap Budgeting Xueqian Zhao, Yonghe Guo, Zhuo Feng and Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University,
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationChallenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs
Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs M. J. McNenly and R. A. Whitesides GPU Technology Conference March 27, 2014 San Jose, CA LLNL-PRES-652254! This work performed under
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationSELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND
Student Submission for the 5 th OpenFOAM User Conference 2017, Wiesbaden - Germany: SELECTIVE ALGEBRAIC MULTIGRID IN FOAM-EXTEND TESSA UROIĆ Faculty of Mechanical Engineering and Naval Architecture, Ivana
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationCSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices
CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationMultilevel Summation of Electrostatic Potentials Using GPUs
Multilevel Summation of Electrostatic Potentials Using GPUs David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More information