The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Size: px
Start display at page:

Download "The Fast Multipole Method on NVIDIA GPUs and Multicore Processors"

Transcription

1 The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied Computational Science, Harvard University Institute for Computational and Mathematical Engineering, Stanford University May Takahashi, Cecka, Darve FMM 1 / 25

2 Outline 1 Fast multipole method 2 Multipole-to-local operator: blocking and data access pattern 3 Particle-particle interactions 4 Numerical results: multicore CPU and NVIDIA GPU Takahashi, Cecka, Darve FMM 2 / 25

3 Fast Multipole Method Applications: Gravitational simulations, molecular dynamics, electrostatic forces Integral equations for Laplace, Poisson, Stokes, Navier. Both fluid and solid mechanics. Interpolation using radial basis functions: meshless methods, graphics (data smoothing) Imaging and inverse problems: subsurface characterization and monitoring, imaging O(N) fast linear algebra: fast matrix-vector and matrix-matrix products, matrix inverse, LU factorization, for dense and sparse matrices, FastLAPACK Takahashi, Cecka, Darve FMM 3 / 25

4 Matrix-vector products FMM: a method to calculate N φ(x i ) = K(x i, x j ) σ j, j=1 1 i N in O(N) or O(N ln α N) arithmetic operations. There are many different FMM formulations. Most of them rely on low-rank approximations of the kernel function K in the form: p K(x, y) = u q (x) q=1 p T qs v s (y) + ε s=1 Takahashi, Cecka, Darve FMM 4 / 25

5 FMM low-rank approximation Many formulas are possible to obtain a low-rank approximation. We use an approach based on interpolation formulas for smooth functions. This allows building an FMM that is: Very general: any smooth (even oscillatory) kernels can be used. Simple to apply: we only need a routine that evaluates K. Converges rapidly and has optimal rank given an L 2 error. It is however less efficient than specialized methods for K(x, y) = 1/ x y. Takahashi, Cecka, Darve FMM 5 / 25

6 Interpolation scheme Low-rank approximation is obtained through a Chebyshev interpolation scheme: ( ) l λ K(x, y) 2 k+1 R n ( x l, x) K( x l, ȳ m ) R n (ȳ m, y) l m x 0 x 1 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 W. Fong, E. Darve, The black-box fast multipole method, J. Comput. Phys., 228(23) (2009) Takahashi, Cecka, Darve FMM 6 / 25

7 FMM operators x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 L2L M2M L2L x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 M2M and L2L operators are transpose of each other. The operator depends only on the relative position of each cluster. It is independent of the location of particles inside each cluster. Takahashi, Cecka, Darve FMM 7 / 25

8 C C C 0 C 0 r Data structures and well-separated j r j of clusters P j K(ri,rj) j. each octant into smaller octants. This process is recursively repeated until the clusters are small enough tha interactions between particles in neighboring clusters can be computed directly, that is by direct evaluatio In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they shar a vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that ar their own neighbors, each cluster is therefore only required to interact with a relatively small number o Figure 1: Figure Two 1: spheres Two spheres containing containing r i and r ji illustrating and r j illustrating the restricted the restricted domain domain of validity of validity of (1). of (1). clusters, at most 189 (= ) in 3D. This is shown in Figure 3. E. Darve, C. Cecka, T. Takahashi (a) (a) (b) (b) Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has t interact with all the light gray clusters around it. In 3D there are 189 of them. Figure 2: Figure 2D illustration: 2: 2D illustration: (a) A square (a) A is square subdivided is subdivided into four into quadrants; four quadrants; approximation approximation (1) (well-separated (1) (well-separated condition) condition) does not does apply notoapply any pair to any of quadrants. pair of quadrants. (b) Each(b) quadrant Each quadrant from Figure from 2(a) Figure is further 2(a) is further subdivided subdivided into 4 quadrants. into 4 quadrants. The interaction The interaction between between particles particles in the two in the gray two clusters gray clusters can Treenow Root can benow computed be computed using the using the low-rank low-rank approximation approximation (1). The(1). interaction The interaction between between the graythe cluster gray and cluster its neighboring and its neighboring clusters clusters (with the (with the diagonal diagonal pattern) pattern) can notcan be computed not be computed at this level at this oflevel the tree. of the A tree. further A further decomposition decomposition of eachof quadrant each quadrant is is required. required. Upward Pass Downward Pass Instead Instead each octant eachis octant further is further decomposed decomposed M into octants as shown in Figure 2(b). This allows computing C0 into octants as shown in Figure 2(b). This allows computing L C0 s q all interactions all interactions between between pairs ofpairs clusters of clusters that arethat not nearest not nearest neighbors, neighbors, e.g., thee.g., pairthe of pair grayof clusters gray clusters in in Figure Figure 2(b). Interactions 2(b). Interactions betweenm2m particles particles in neighboring in neighboring clusters clusters are computed are computed by further by further subdividing subdividing L2L each octant eachinto octant smaller into smaller octants. octants. This process This process is recursively is recursively repeated repeated until theuntil clusters the clusters are small areenough small enough that that interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation of P interactions between particles in neighboring clusters can be computed directly, that is by direct evaluation j K(ri,rj) of P j K(ri,rj) j. j. Ms C L C q In a typical In a typical situation, situation, a cluster a is cluster required is required to interact to interact only with only clusters with clusters that arethat neighbors are neighbors (i.e., they (i.e., share they share a vertex, E. Darve, a vertex, edge, C. oredge, Cecka, face) or of and face) their T. of parent Takahashi. theircluster. parent The cluster. In addition, fast Figure In multipole addition, since 4: Illustration they method since cannot they on of parallel cannot interact the three interact clusters, with passes clusters with multicore inclusters the thatfmm. are processors, are and graphics processing units. theircomptes own neighbors, own Rendus neighbors, each Mécanique, cluster each cluster is 339(2 3): , therefore is therefore only required only required to interact to interact with a relatively with a relatively small number small number of of clusters, clusters, at mostat 189 most (= 189 Takahashi, 6 3 (= 3 3 6) 3 in Cecka, 3D. 3 3 ) This in 3D. Darve is shown This is in shown Figure in 3. Figure 3. FMM 8 / 25

9 operator The operator K( x l, ȳ m ) needs to be applied many times. It is independent of the particles positions x i. This allows certain optimizations to be computed off-line. A singular value decomposition is used to compress the operator K( x l, ȳ m ) to an optimal rank: K IJ = = Ω 1/2 o U Σ V T Ω 1/2 s = K IJ σ J = ( σ J ) Cost = 2rN instead of N 2 Takahashi, Cecka, Darve FMM 9 / 25

10 Convergence of Chebyshev interpolation and SVD compression omputational time in R 2 ; is the prescribed accuracy for ACA; is the Chebyshev interpolation order n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = ,000 Relative error ε L l=4 l=5 l=6 l=7 l=8 l=9 l=10 l=11 l=12 l= Target accuracy ε ACA Fig. 8. Example geometries. Error vs SVD cutoff for K = 1/r 2D prolate mesh Error vs target accuracy for K = e ikr /r e not developed M. Messner, a method M. to Schanz, optimize and thee. parameters Darve. Fast in the directional method. Asmultilevel a result thesummation running times for oscillatory kernels be improved by changing our choice of parameters. In order to be consistent throughout all studies, based on Chebyshev interpolation. J. Comp. Phys., he following parameter setting. Recall that w and k denote the length of the side of a cluster and the Takahashi, Cecka, Darve FMM 10 / 25

11 operator The operator, inside a given level, leads to a set of nested loops: For all cluster C in this level { For all cluster C in the interaction list of C { For all q=1 to p { For all s=1 to p { L(q)(C) += T(q,s)(C,C ) M(s)(C ); } } } } C and C are clusters that are well separated. The matrix T(q,s) is the operator. The vector M has the multipole coefficients and L the local coefficients. The core operation is therefore a matrix-vector product. Takahashi, Cecka, Darve FMM 11 / 25

12 Blocking strategies Four different strategies: 1 Basic approach: matrix-vector products. 2 Sibling blocking. Reuse T(q,s)(C,C ) for all (C,C ) that share the same T. This allows using matrix-matrix products instead of matrix-vector products. 64 MV 27 MM T(q,s)(C,C ) S 0 S 1 S 2 S 3 S 0 S 1 S 0 S 1 S 0 S 1 S 2 S 3 O 0 O 1 O 0 O 1 S 2 S 3 O 0 O 1 S 2 S 3 O 0 O 1 O 2 O 3 O 2 O 3 O 2 O 3 O 2 O 3 3 Further block siblings into larger chunks. This is possible if the number of clusters is sufficient. Takahashi, Cecka, Darve FMM 12 / 25

13 Fourth blocking strategy C C 4 Move the q s loop outside. This allows the greatest degree of reuse of T(q,s)(C,C ) across all 316 pairs of clusters. Downside: calculation is slightly more irregular due to the different treatment of each sibling. This only works at level 3 and below. T. Takahashi, C. Cecka, and E. Darve. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int. J. Num. Meth. Engr., Takahashi, Cecka, Darve FMM 13 / 25

14 Arithmetic intensity Algorithm variant Arithmetic intensity Algorithm Algorithm Algorithm Algorithm Platform and algorithm Gflops CPU GPU A GPU A GPU A GPU A : memory bound; arithmetic bound Left: ratio of number of flops performed by kernel divided by number of words read from memory. Arithmetic intensity should be around for peak efficiency. Right: Performance of in single-precision. Tech. details: medium accuracy, rank = 32, N = 10 7, K = 1/r, cube with random uniform dist., DELL Poweredge 1950 with two quad-core Intel Xeon E GHz CPUs (OpenMP), Tesla C2050. Takahashi, Cecka, Darve FMM 14 / 25

15 Comparison of algorithms on CPU (under development) Alg. P n d N = CPU-6x 2 s (3) 64(4) 65(5) 65(6) 65(7) CPU-6x 4 s 4 0 5(3) 12(4) 22(5) 25(6) 27(7) CPU-6x 2 s (2) 28(3) 31(4) 31(5) 32(6) CPU-6x 4 s 8-1 1(2) 5(3) 13(4) 22(5) 27(6) Gflops in computation (tree depth); algo. 2 and 4 on CPU 6-core Xeon X GHz P: precision [s d]; n: parameter that determines the accuracy [4 8] d: parameter to adjust the depth of the tree; N: number of particles Takahashi, Cecka, Darve FMM 15 / 25

16 Short-range interactions Assign one thread-block to one leaf box T. I : number of threads per thread-block. I threads process I observation points in T each iteration. For every source box S in N (T ), all the threads load data in shared memory (x j and σ j ) for J sources in S. The loop over sources (summation over j) is unrolled to improve performance. With this approach a high-arithmetic intensity can be reached. Leaf box T Group of I observation points x i Thread copies data to shared memory Thread computes (i, j) interactions J sources with charge σ j at x j Source box S Takahashi, Cecka, Darve FMM 16 / 25

17 Validation of both codes P n d CPU s 4 0 GPU s e e e e e e-4 CPU s e e e-5 GPU s e e e-5 CPU d 4 0 GPU d e e e e e e-4 CPU d e e e-7 GPU d e e e-7 Relative l 2 error of the CPU and GPU codes versus N 2 code. K(x, y) = 1/ x y. Points are randomly uniformly distributed in a box. Takahashi, Cecka, Darve FMM 17 / 25

18 Hardware used for tests Peak performance: top is single precision while bottom is double precision. Processor Xeon X5680 Tesla C2050 six-core, 3.33GHz two sockets # of cores in total Peak performance [Gflop/s] Bandwidth [GB/s] Flop-to-word ratio Hyper-threading is not used. Performance for MAD operation. 32 GB/s per socket. ECC is enabled. Takahashi, Cecka, Darve FMM 18 / 25

19 Benchmark results: serial CPU breakdown Time [%] Time [%] CPU code (Single precision, n=4, d=0) N CPU code (Double precision, n=4, d=0) direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup Time [%] Time [%] CPU code (Single precision, n=8, d=-1) N CPU code (Double precision, n=8, d=-1) direct L2P L2L M2M P2M setup direct L2P L2L M2M P2M setup P = particle M = multipole exp. L = local exp. Direct (P2P for leaf clusters) and occupy 60 98% of the total running time N N Takahashi, Cecka, Darve FMM 19 / 25

20 Benchmark results: CPU speedup using 12 cores (OpenMP) Single precision Speed-up CPU code (Single precision, n=4, d=0) N Total direct L2P L2L M2M P2M setup Speed-up CPU code (Single precision, n=8, d=-1) N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 20 / 25

21 Benchmark results: CPU speedup using 12 cores (OpenMP) Double precision Speed-up CPU code (Double precision, n=4, d=0) N Total direct L2P L2L M2M P2M setup Speed-up CPU code (Double precision, n=8, d=-1) N Total direct L2P L2L M2M P2M setup Takahashi, Cecka, Darve FMM 21 / 25

22 Speed-up of GPU vs CPU-12x Speed-up GPU vs CPU (Single precision, n=4, d=0) Total direct Speed-up GPU vs CPU (Single precision, n=8, d=-1) Total direct N N GPU vs CPU (Double precision, n=4, d=0) 4 3 Total direct GPU vs CPU (Double precision, n=8, d=-1) Total direct Speed-up 2 Speed-up N N Takahashi, Cecka, Darve FMM 22 / 25

23 Gflops performance: computation P n d CPU-12x s (3) 120(4) 123(5) 126(6) 128(7) GPU s (3) 173(4) 302(5) 359(6) 379(7) CPU-12x s (2) 39(3) 41(4) 43(5) 46(6) GPU s (2) 65(3) 192(4) 319(5) 375(6) CPU-12x d (3) 73(4) 75(5) 75(6) NA GPU d (3) 83(4) 140(5) 165(6) NA CPU-12x d (2) 18(3) 19(4) 20(5) NA GPU d (2) 30(3) 88(4) 146(5) NA Takahashi, Cecka, Darve FMM 23 / 25

24 Gflops performance: P2P P2P computation P n d CPU-12x s (3) 169(4) 182(5) 194(6) 210(7) GPU s (3) 282(4) 376(5) 491(6) 629(7) CPU-12x s (2) 261(3) 275(4) 282(5) 286(6) GPU s (2) 705(3) 826(4) 852(5) 861(6) CPU-12x d (3) 38(4) 39(5) 40(6) NA GPU d (3) 96(4) 124(5) 157(6) NA CPU-12x d (2) 38(3) 40(4) 41(5) NA GPU d (2) 207(3) 243(4) 249(5) NA Takahashi, Cecka, Darve FMM 24 / 25

25 Conclusion Efficient and flexible fast multipole method: can be used for any analytic kernel, smooth or oscillatory. Good fraction of peak performance can be achieved. There is still room for improvement (e.g., instruction throughput optimization). We have not considered the case of an adaptive tree with a varying number of levels. Current work: use starpu to parallelize the method on multiple GPUs and CPUs. Overcome the tree-root bottleneck by dynamically scheduling the M2M,, L2L and P2P operators. Increase arithmetic intensity using symmetries. The code, bbfmm, used for this work is available from Takahashi, Cecka, Darve FMM 25 / 25

Cauchy Fast Multipole Method for Analytic Kernel

Cauchy Fast Multipole Method for Analytic Kernel Cauchy Fast Multipole Method for Analytic Kernel Pierre-David Létourneau 1 Cristopher Cecka 2 Eric Darve 1 1 Stanford University 2 Harvard University ICIAM July 2011 Pierre-David Létourneau Cristopher

More information

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2012; 89:105 133 Published online 3 August 2011 in Wiley Online Library (wileyonlinelibrary.com)..3240 Optimizing the

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov Efficiency by exploiting symmetry and A general

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Kernel Independent FMM

Kernel Independent FMM Kernel Independent FMM FMM Issues FMM requires analytical work to generate S expansions, R expansions, S S (M2M) translations S R (M2L) translations R R (L2L) translations Such analytical work leads to

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1 ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Evaluation of Intel Memory Drive Technology Performance for Scientific Applications Vladimir Mironov, Andrey Kudryavtsev, Yuri Alexeev, Alexander Moskovsky, Igor Kulikov, and Igor Chernykh Introducing

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

cuibm A GPU Accelerated Immersed Boundary Method

cuibm A GPU Accelerated Immersed Boundary Method cuibm A GPU Accelerated Immersed Boundary Method S. K. Layton, A. Krishnan and L. A. Barba Corresponding author: labarba@bu.edu Department of Mechanical Engineering, Boston University, Boston, MA, 225,

More information

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Stokes Preconditioning on a GPU

Stokes Preconditioning on a GPU Stokes Preconditioning on a GPU Matthew Knepley 1,2, Dave A. Yuen, and Dave A. May 1 Computation Institute University of Chicago 2 Department of Molecular Biology and Physiology Rush University Medical

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Task-based FMM for heterogeneous architectures

Task-based FMM for heterogeneous architectures Task-based FMM for heterogeneous architectures Emmanuel Agullo, Berenger Bramas, Olivier Coulaud, Eric Darve, Matthias Messner, Toru Takahashi To cite this version: Emmanuel Agullo, Berenger Bramas, Olivier

More information

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Fast multipole methods for axisymmetric geometries

Fast multipole methods for axisymmetric geometries Fast multipole methods for axisymmetric geometries Victor Churchill New York University May 13, 2016 Abstract Fast multipole methods (FMMs) are one of the main numerical methods used in the solution of

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

An Approximate Singular Value Decomposition of Large Matrices in Julia

An Approximate Singular Value Decomposition of Large Matrices in Julia An Approximate Singular Value Decomposition of Large Matrices in Julia Alexander J. Turner 1, 1 Harvard University, School of Engineering and Applied Sciences, Cambridge, MA, USA. In this project, I implement

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Empirical Analysis of Space Filling Curves for Scientific Computing Applications

Empirical Analysis of Space Filling Curves for Scientific Computing Applications Empirical Analysis of Space Filling Curves for Scientific Computing Applications Daryl DeFord 1 Ananth Kalyanaraman 2 1 Department of Mathematics 2 School of Electrical Engineering and Computer Science

More information

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Automated Finite Element Computations in the FEniCS Framework using GPUs

Automated Finite Element Computations in the FEniCS Framework using GPUs Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Asynchronous OpenCL/MPI numerical simulations of conservation laws Asynchronous OpenCL/MPI numerical simulations of conservation laws Philippe HELLUY 1,3, Thomas STRUB 2. 1 IRMA, Université de Strasbourg, 2 AxesSim, 3 Inria Tonus, France IWOCL 2015, Stanford Conservation

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods

A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods A Random Variable Shape Parameter Strategy for Radial Basis Function Approximation Methods Scott A. Sarra, Derek Sturgill Marshall University, Department of Mathematics, One John Marshall Drive, Huntington

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

Parallel and Distributed Systems Lab.

Parallel and Distributed Systems Lab. Parallel and Distributed Systems Lab. Department of Computer Sciences Purdue University. Jie Chi, Ronaldo Ferreira, Ananth Grama, Tzvetan Horozov, Ioannis Ioannidis, Mehmet Koyuturk, Shan Lei, Robert Light,

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

Nasim Mahmood, Guosheng Deng, and James C. Browne

Nasim Mahmood, Guosheng Deng, and James C. Browne Nasim Mahmood, Guosheng Deng, and James C. Browne Motivation Goals Programming Model Example Case Study Conclusions Current and Ongoing Work Future Directions Optimization and adaptation of parallel programs

More information

Sparse Matrices in Image Compression

Sparse Matrices in Image Compression Chapter 5 Sparse Matrices in Image Compression 5. INTRODUCTION With the increase in the use of digital image and multimedia data in day to day life, image compression techniques have become a major area

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1 Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing

More information

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016 Challenges What is Algebraic Multi-Grid (AMG)? AGENDA Why use AMG? When to use AMG? NVIDIA AmgX Results 2

More information

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs JH160041-NAHI Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs Rio Yokota(Tokyo Institute of Technology) Hierarchical low-rank approximation methods such as H-matrix, H2-matrix,

More information

Fast Methods with Sieve

Fast Methods with Sieve Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley

More information

Journal of Engineering Research and Studies E-ISSN

Journal of Engineering Research and Studies E-ISSN Journal of Engineering Research and Studies E-ISS 0976-79 Research Article SPECTRAL SOLUTIO OF STEADY STATE CODUCTIO I ARBITRARY QUADRILATERAL DOMAIS Alavani Chitra R 1*, Joshi Pallavi A 1, S Pavitran

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Two-Phase flows on massively parallel multi-gpu clusters

Two-Phase flows on massively parallel multi-gpu clusters Two-Phase flows on massively parallel multi-gpu clusters Peter Zaspel Michael Griebel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn Workshop Programming of Heterogeneous

More information

FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE

FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE FFT, FMM, OR MULTIGRID? A COMPARATIVE STUDY OF STATE-OF-THE-ART POISSON SOLVERS FOR UNIFORM AND NON-UNIFORM GRIDS IN THE UNIT CUBE AMIR GHOLAMI, DHAIRYA MALHOTRA, HARI SUNDAR, AND GEORGE BIROS Abstract.

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information

Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction

Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction Parallel Interpolation in FSI Problems Using Radial Basis Functions and Problem Size Reduction Sergey Kopysov, Igor Kuzmin, Alexander Novikov, Nikita Nedozhogin, and Leonid Tonkov Institute of Mechanics,

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

A comparison of parallel rank-structured solvers

A comparison of parallel rank-structured solvers A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

How HPC Hardware and Software are Evolving Towards Exascale

How HPC Hardware and Software are Evolving Towards Exascale How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Long time integrations of a convective PDE on the sphere by RBF collocation

Long time integrations of a convective PDE on the sphere by RBF collocation Long time integrations of a convective PDE on the sphere by RBF collocation Bengt Fornberg and Natasha Flyer University of Colorado NCAR Department of Applied Mathematics Institute for Mathematics Applied

More information