The Fast Multipole Method on NVIDIA GPUs and Multicore Processors
|
|
- Vivian Mosley
- 5 years ago
- Views:
Transcription
1 The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied Computational Science, Harvard University Institute for Computational and Mathematical Engineering, Stanford University May Takahashi, Cecka, Darve FMM 1 / 25
2 Outline 1 Fast multipole method 2 Multipole-to-local operator: blocking and data access pattern 3 Particle-particle interactions 4 Numerical results: multicore CPU and NVIDIA GPU Takahashi, Cecka, Darve FMM 2 / 25
3 Fast Multipole Method Applications: Gravitational simulations, molecular dynamics, electrostatic forces Integral equations for Laplace, Poisson, Stokes, Navier. Both fluid and solid mechanics. Interpolation using radial basis functions: meshless methods, graphics (data smoothing) Imaging and inverse problems: subsurface characterization and monitoring, imaging O(N) fast linear algebra: fast matrix-vector and matrix-matrix products, matrix inverse, LU factorization, for dense and sparse matrices, FastLAPACK Takahashi, Cecka, Darve FMM 3 / 25
4 Matrix-vector products FMM: a method to calculate N φ(x i ) = K(x i, x j ) σ j, j=1 1 i N in O(N) or O(N ln α N) arithmetic operations. There are many different FMM formulations. Most of them rely on low-rank approximations of the kernel function K in the form: p K(x, y) = u q (x) q=1 p T qs v s (y) + ε s=1 Takahashi, Cecka, Darve FMM 4 / 25
5 FMM low-rank approximation Many formulas are possible to obtain a low-rank approximation. We use an approach based on interpolation formulas for smooth functions. This allows building an FMM that is: Very general: any smooth (even oscillatory) kernels can be used. Simple to apply: we only need a routine that evaluates K. Converges rapidly and has optimal rank given an L 2 error. It is however less efficient than specialized methods for K(x, y) = 1/ x y. Takahashi, Cecka, Darve FMM 5 / 25
6 Interpolation scheme Low-rank approximation is obtained through a Chebyshev interpolation scheme: ( ) l λ K(x, y) 2 k+1 R n ( x l, x) K( x l, ȳ m ) R n (ȳ m, y) l m x 0 x 1 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 W. Fong, E. Darve, The black-box fast multipole method, J. Comput. Phys., 228(23) (2009) Takahashi, Cecka, Darve FMM 6 / 25
7 FMM operators x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 L2L M2M L2L x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 M2M and L2L operators are transpose of each other. The operator depends only on the relative position of each cluster. It is independent of the location of particles inside each cluster. Takahashi, Cecka, Darve FMM 7 / 25
8 C C C 0 C 0 r Data structures and well-separated j r j of clusters P j K(ri,rj) j. each octant into smaller octants. This process is recursively repeated until the clusters are small enough tha interactions between particles in neighboring clusters can be computed directly, that is by direct evaluatio In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they shar a vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that ar their own neighbors, each cluster is therefore only required to interact with a relatively small number o Figure 1: Figure Two 1: spheres Two spheres containing containing r i and r ji illustrating and r j illustrating the restricted the restricted domain domain of validity of validity of (1). of (1). clusters, at most 189 (= ) in 3D. This is shown in Figure 3. E. Darve, C. Cecka, T. Takahashi (a) (a) (b) (b) Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has t interact with all the light gray clusters around it. In 3D there are 189 of them. Figure 2: Figure 2D illustration: 2: 2D illustration: (a) A square (a) A is square subdivided is subdivided into four into quadrants; four quadrants; approximation approximation (1) (well-separated (1) (well-separated condition) condition) does not does apply notoapply any pair to any of quadrants. pair of quadrants. (b) Each(b) quadrant Each quadrant from Figure from 2(a) Figure is further 2(a) is further subdivided subdivided into 4 quadrants. into 4 quadrants. The interaction The interaction between between particles particles in the two in the gray two clusters gray clusters can Treenow Root can benow computed be computed using the using the low-rank low-rank approximation approximation (1). The(1). interaction The interaction between between the graythe cluster gray and cluster its neighboring and its neighboring clusters clusters (with the (with the diagonal diagonal pattern) pattern) can notcan be computed not be computed at this level at this oflevel the tree. of the A tree. further A further decomposition decomposition of eachof quadrant each quadrant is is required. required. Upward Pass Downward Pass Instead Instead each octant eachis octant further is further decomposed decomposed M into octants as shown in Figure 2(b). This allows computing C0 into octants as shown in Figure 2(b). This allows computing L C0 s q all interactions all interactions between between pairs ofpairs clusters of clusters that arethat not nearest not nearest neighbors, neighbors, e.g., thee.g., pairthe of pair grayof clusters gray clusters in in Figure Figure 2(b). Interactions 2(b). Interactions betweenm2m particles particles in neighboring in neighboring clusters clusters are computed are computed by further by further subdividing subdividing L2L each octant eachinto octant smaller into smaller octants. octants. This process This process is recursively is recursively repeated repeated until theuntil clusters the clusters are small areenough small enough that that interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation of P interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation j K(ri,rj) of P j K(ri,rj) j. j. Ms C L C q In a typical In a typical situation, situation, a cluster a is cluster required is required to interact to interact only with only clusters with clusters that arethat neighbors are neighbors (i.e., they (i.e., share they share a vertex, E. Darve, a vertex, edge, C. oredge, Cecka, face) or of and face) their T. of parent Takahashi. theircluster. parent The cluster. In addition, fast Figure In multipole addition, since 4: Illustration they method since cannot they on of parallel cannot interact the three interact clusters, with passes clusters with multicore inclusters the thatfmm. are processors, are and graphics processing units. theircomptes own neighbors, own Rendus neighbors, each Mécanique, cluster each cluster is 339(2 3): , therefore is therefore only required only required to interact to interact with a relatively with a relatively small number small number of of clusters, clusters, at mostat 189 most (= 189 Takahashi, 6 3 (= 3 3 6) 3 in Cecka, 3D. 3 3 ) This in 3D. Darve is shown This is in shown Figure in 3. Figure 3. FMM 8 / 25
9 operator The operator K( x l, ȳ m ) needs to be applied many times. It is independent of the particles positions x i. This allows certain optimizations to be computed off-line. A singular value decomposition is used to compress the operator K( x l, ȳ m ) to an optimal rank: K IJ = = Ω 1/2 o U Σ V T Ω 1/2 s = K IJ σ J = ( σ J ) Cost = 2rN instead of N 2 Takahashi, Cecka, Darve FMM 9 / 25
10 Convergence of Chebyshev interpolation and SVD compression omputational time in R 2 ; is the prescribed accuracy for ACA; is the Chebyshev interpolation order n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = ,000 Relative error ε L l=4 l=5 l=6 l=7 l=8 l=9 l=10 l=11 l=12 l= Target accuracy ε ACA Fig. 8. Example geometries. Error vs SVD cutoff for K = 1/r 2D prolate mesh Error vs target accuracy for K = e ikr /r e not developed M. Messner, a method M. to Schanz, optimize and thee. parameters Darve. Fast in the directional method. Asmultilevel a result thesummation running times for oscillatory kernels be improved by changing our choice of parameters. In order to be consistent throughout all studies, based on Chebyshev interpolation. J. Comp. Phys., he following parameter setting. Recall that w and k denote the length of the side of a cluster and the Takahashi, Cecka, Darve FMM 10 / 25
11 operator The operator, inside a given level, leads to a set of nested loops: For all cluster C in this level { For all cluster C in the interaction list of C { For all q=1 to p { For all s=1 to p { L(q)(C) += T(q,s)(C,C ) M(s)(C ); } } } } C and C are clusters that are well separated. The matrix T(q,s) is the operator. The vector M has the multipole coefficients and L the local coefficients. The core operation is therefore a matrix-vector product. Takahashi, Cecka, Darve FMM 11 / 25
12 Blocking strategies Four different strategies: 1 Basic approach: matrix-vector products. 2 Sibling blocking. Reuse T(q,s)(C,C ) for all (C,C ) that share the same T. This allows using matrix-matrix products instead of matrix-vector products. 64 MV 27 MM T(q,s)(C,C ) S 0 S 1 S 2 S 3 S 0 S 1 S 0 S 1 S 0 S 1 S 2 S 3 O 0 O 1 O 0 O 1 S 2 S 3 O 0 O 1 S 2 S 3 O 0 O 1 O 2 O 3 O 2 O 3 O 2 O 3 O 2 O 3 3 Further block siblings into larger chunks. This is possible if the number of clusters is sufficient. Takahashi, Cecka, Darve FMM 12 / 25
13 Fourth blocking strategy C C 4 Move the q s loop outside. This allows the greatest degree of reuse of T(q,s)(C,C ) across all 316 pairs of clusters. Downside: calculation is slightly more irregular due to the different treatment of each sibling. This only works at level 3 and below. T. Takahashi, C. Cecka, and E. Darve. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int. J. Num. Meth. Engr., Takahashi, Cecka, Darve FMM 13 / 25
14 Arithmetic intensity Algorithm variant Arithmetic intensity Algorithm Algorithm Algorithm Algorithm Platform and algorithm Gflops CPU GPU A GPU A GPU A GPU A : memory bound; arithmetic bound Left: ratio of number of flops performed by kernel divided by number of words read from memory. Arithmetic intensity should be around for peak efficiency. Right: Performance of in single-precision. Tech. details: medium accuracy, rank = 32, N = 10 7, K = 1/r, cube with random uniform dist., DELL Poweredge 1950 with two quad-core Intel Xeon E GHz CPUs (OpenMP), Tesla C2050. Takahashi, Cecka, Darve FMM 14 / 25
15 Comparison of algorithms on CPU (under development) Alg. P n d N = CPU-6x 2 s (3) 64(4) 65(5) 65(6) 65(7) CPU-6x 4 s 4 0 5(3) 12(4) 22(5) 25(6) 27(7) CPU-6x 2 s (2) 28(3) 31(4) 31(5) 32(6) CPU-6x 4 s 8-1 1(2) 5(3) 13(4) 22(5) 27(6) Gflops in computation (tree depth); algo. 2 and 4 on CPU 6-core Xeon X GHz P: precision [s d]; n: parameter that determines the accuracy [4 8] d: parameter to adjust the depth of the tree; N: number of particles Takahashi, Cecka, Darve FMM 15 / 25
16 Short-range interactions Assign one thread-block to one leaf box T. I : number of threads per thread-block. I threads process I observation points in T each iteration. For every source box S in N (T ), all the threads load data in shared memory (x j and σ j ) for J sources in S. The loop over sources (summation over j) is unrolled to improve performance. With this approach a high-arithmetic intensity can be reached. Leaf box T Group of I observation points x i Thread copies data to shared memory Thread computes (i, j) interactions J sources with charge σ j at x j Source box S Takahashi, Cecka, Darve FMM 16 / 25
17 Validation of both codes P n d CPU s 4 0 GPU s e e e e e e-4 CPU s e e e-5 GPU s e e e-5 CPU d 4 0 GPU d e e e e e e-4 CPU d e e e-7 GPU d e e e-7 Relative l 2 error of the CPU and GPU codes versus N 2 code. K(x, y) = 1/ x y. Points are randomly uniformly distributed in a box. Takahashi, Cecka, Darve FMM 17 / 25
18 Hardware used for tests Peak performance: top is single precision while bottom is double precision. Processor Xeon X5680 Tesla C2050 six-core, 3.33GHz two sockets # of cores in total Peak performance [Gflop/s] Bandwidth [GB/s] Flop-to-word ratio Hyper-threading is not used. Performance for MAD operation. 32 GB/s per socket. ECC is enabled. Takahashi, Cecka, Darve FMM 18 / 25
19 Benchmark results: serial CPU breakdown Time [%] Time [%] CPU code (Single precision, n=4, d=0) N CPU code (Double precision, n=4, d=0) direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup Time [%] Time [%] CPU code (Single precision, n=8, d=-1) N CPU code (Double precision, n=8, d=-1) direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup P = particle M = multipole exp. L = local exp. Direct (P2P for leaf clusters) and occupy 60 98% of the total running time N N Takahashi, Cecka, Darve FMM 19 / 25
20 Benchmark results: CPU speedup using 12 cores (OpenMP) Single precision Speed-up CPU code (Single precision, n=4, d=0) N Total direct L2P L2L M2M P2M setup Speed-up CPU code (Single precision, n=8, d=-1) N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 20 / 25
21 Benchmark results: CPU speedup using 12 cores (OpenMP) Double precision Speed-up CPU code (Double precision, n=4, d=0) N Total direct L2P L2L M2M P2M setup Speed-up CPU code (Double precision, n=8, d=-1) N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 21 / 25
22 Speed-up of GPU vs CPU-12x Speed-up GPU vs CPU (Single precision, n=4, d=0) Total direct Speed-up GPU vs CPU (Single precision, n=8, d=-1) Total direct N N GPU vs CPU (Double precision, n=4, d=0) 4 3 Total direct GPU vs CPU (Double precision, n=8, d=-1) Total direct Speed-up 2 Speed-up N N Takahashi, Cecka, Darve FMM 22 / 25
23 Gflops performance: computation P n d CPU-12x s (3) 120(4) 123(5) 126(6) 128(7) GPU s (3) 173(4) 302(5) 359(6) 379(7) CPU-12x s (2) 39(3) 41(4) 43(5) 46(6) GPU s (2) 65(3) 192(4) 319(5) 375(6) CPU-12x d (3) 73(4) 75(5) 75(6) NA GPU d (3) 83(4) 140(5) 165(6) NA CPU-12x d (2) 18(3) 19(4) 20(5) NA GPU d (2) 30(3) 88(4) 146(5) NA Takahashi, Cecka, Darve FMM 23 / 25
24 Gflops performance: P2P P2P computation P n d CPU-12x s (3) 169(4) 182(5) 194(6) 210(7) GPU s (3) 282(4) 376(5) 491(6) 629(7) CPU-12x s (2) 261(3) 275(4) 282(5) 286(6) GPU s (2) 705(3) 826(4) 852(5) 861(6) CPU-12x d (3) 38(4) 39(5) 40(6) NA GPU d (3) 96(4) 124(5) 157(6) NA CPU-12x d (2) 38(3) 40(4) 41(5) NA GPU d (2) 207(3) 243(4) 249(5) NA Takahashi, Cecka, Darve FMM 24 / 25
25 Conclusion Efficient and flexible fast multipole method: can be used for any analytic kernel, smooth or oscillatory. Good fraction of peak performance can be achieved. There is still room for improvement (e.g., instruction throughput optimization). We have not considered the case of an adaptive tree with a varying number of levels. Current work: use starpu to parallelize the method on multiple GPUs and CPUs. Overcome the tree-root bottleneck by dynamically scheduling the M2M,, L2L and P2P operators. Increase arithmetic intensity using symmetries. The code, bbfmm, used for this work is available from Takahashi, Cecka, Darve FMM 25 / 25
Cauchy Fast Multipole Method for Analytic Kernel
Cauchy Fast Multipole Method for Analytic Kernel Pierre-David Létourneau 1 Cristopher Cecka 2 Eric Darve 1 1 Stanford University 2 Harvard University ICIAM July 2011 Pierre-David Létourneau Cristopher
More informationOptimizing the multipole-to-local operator in the fast multipole method for graphical processing units
INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2012; 89:105 133 Published online 3 August 2011 in Wiley Online Library (wileyonlinelibrary.com)..3240 Optimizing the
More informationA Kernel-independent Adaptive Fast Multipole Method
A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges
More informationCenter for Computational Science
Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,
More informationFast Multipole and Related Algorithms
Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general
More informationFast Multipole Method on the GPU
Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical
More informationKernel Independent FMM
Kernel Independent FMM FMM Issues FMM requires analytical work to generate S expansions, R expansions, S S (M2M) translations S R (M2L) translations R R (L2L) translations Such analytical work leads to
More informationIterative methods for use with the Fast Multipole Method
Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1
ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,
More informationA GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang
A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal
More informationEvaluation of Intel Memory Drive Technology Performance for Scientific Applications
Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing
More informationFMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)
FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast
More informationcuibm A GPU Accelerated Immersed Boundary Method
cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationStokes Preconditioning on a GPU
Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationExploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA
More informationOn the limits of (and opportunities for?) GPU acceleration
On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,
More informationTask-based FMM for heterogeneous architectures
Task-based FMM for heterogeneous architectures Emmanuel Agullo, Berenger Bramas, Olivier Coulaud, Eric Darve, Matthias Messner, Toru Takahashi To cite this version: Emmanuel Agullo, Berenger Bramas, Olivier
More informationScalable Fast Multipole Methods on Distributed Heterogeneous Architectures
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationFast multipole methods for axisymmetric geometries
Fast multipole methods for axisymmetric geometries Victor Churchill New York University May 13, 2016 Abstract Fast multipole methods (FMMs) are one of the main numerical methods used in the solution of
More informationCMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline
CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationWhy Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends
Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationGREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer
GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationGPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging
GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationEfficient O(N log N) algorithms for scattered data interpolation
Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationAn Approximate Singular Value Decomposition of Large Matrices in Julia
An Approximate Singular Value Decomposition of Large Matrices in Julia Alexander J. Turner 1, 1 Harvard University, School of Engineering and Applied Sciences, Cambridge, MA, USA. In this project, I implement
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationModern GPUs (Graphics Processing Units)
Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationEmpirical Analysis of Space Filling Curves for Scientific Computing Applications
Empirical Analysis of Space Filling Curves for Scientific Computing Applications Daryl DeFord 1 Ananth Kalyanaraman 2 1 Department of Mathematics 2 School of Electrical Engineering and Computer Science
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationTree-based methods on GPUs
Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationAsynchronous OpenCL/MPI numerical simulations of conservation laws
Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationA Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods
A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods Scott A. Sarra, Derek Sturgill Marshall University, Department of Mathematics, One John Marshall Drive, Huntington
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationParallel and Distributed Systems Lab.
Parallel and Distributed Systems Lab. Department of Computer Sciences Purdue University. Jie Chi, Ronaldo Ferreira, Ananth Grama, Tzvetan Horozov, Ioannis Ioannidis, Mehmet Koyuturk, Shan Lei, Robert Light,
More information1.2 Numerical Solutions of Flow Problems
1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian
More informationNasim Mahmood, Guosheng Deng, and James C. Browne
Nasim Mahmood, Guosheng Deng, and James C. Browne Motivation Goals Programming Model Example Case Study Conclusions Current and Ongoing Work Future Directions Optimization and adaptation of parallel programs
More informationSparse Matrices in Image Compression
Chapter 5 Sparse Matrices in Image Compression 5. INTRODUCTION With the increase in the use of digital image and multimedia data in day to day life, image compression techniques have become a major area
More informationImplementation of Adaptive Coarsening Algorithm on GPU using CUDA
Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow
More informationOpenFOAM + GPGPU. İbrahim Özküçük
OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationLarge and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference
Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing
More informationACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016
ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2
More informationHierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs
JH160041-NAHI Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs Rio Yokota(Tokyo Institute of Technology) Hierarchical low-rank approximation methods such as H-matrix, H2-matrix,
More informationFast Methods with Sieve
Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley
More informationJournal of Engineering Research and Studies E-ISSN
Journal of Engineering Research and Studies E-ISS 0976-79 Research Article SPECTRAL SOLUTIO OF STEADY STATE CODUCTIO I ARBITRARY QUADRILATERAL DOMAIS Alavani Chitra R 1*, Joshi Pallavi A 1, S Pavitran
More informationOptimizing Data Locality for Iterative Matrix Solvers on CUDA
Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationTwo-Phase flows on massively parallel multi-gpu clusters
Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous
More informationFFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE
FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE AMIR GHOLAMI, DHAIRYA MALHOTRA, HARI SUNDAR, AND GEORGE BIROS Abstract.
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More informationParallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction
Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction Sergey Kopysov, Igor Kuzmin, Alexander Novikov, Nikita Nedozhogin, and Leonid Tonkov Institute of Mechanics,
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationA comparison of parallel rank-structured solvers
A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationHow HPC Hardware and Software are Evolving Towards Exascale
How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationLong time integrations of a convective PDE on the sphere by RBF collocation
Long time integrations of a convective PDE on the sphere by RBF collocation Bengt Fornberg and Natasha Flyer University of Colorado NCAR Department of Applied Mathematics Institute for Mathematics Applied
More information