Fast Multipole and Related Algorithms

Size: px
Start display at page:

Download "Fast Multipole and Related Algorithms"

Transcription

1 Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park Joint work with Nail A. Gumerov

2 Efficiency by exploiting symmetry and A general paradigm structure Develop theory, algorithms and software that apply to general classes of data Recognize underlying mathematical symmetry Recognize structure in data Adapt algorithms to particular data How does this apply to the FMM?

3 FMM Fast summation of singular kernel functions Green s functions Φ(x,y) for classical operator L Solution y to forcing at points x Applications Electrostatics, Molecular/Stellar dynamics, Vortex Boundary Integral Methods Particle discretization of field equations

4 What is the FMM? Decompose singular sum in to local or sparse part + far-field and dense part

5 Factorization Trick apply to dense part Not FMM, but has some key ideas Consider v(y j )= N i=1 ij u i = N i=1 u i (x i y j ) 2 j=1,,m Naïve way to evaluate sum requires MN operations Instead can write the sum as v(y j )= ( N i=1 u i ) y j2 +( N i=1 u i x i2 ) 2y j ( N i=1 u i x i ) Can evaluate each bracketed sum over i =( N i=1 u i ), = ( N i=1 u i x i2 ), = ( N i=1 u i x i ) and evaluate v(y j )= y 2 j + 2y j Requires O(M+N) operations

6 Reduction of Complexity Straightforward (nested loops): Factorized: Complexity: O(MN) If p << min(m,n) then complexity reduces! Complexity: O(pN+pM)

7 Take this far-near decomposition and apply it recursively Far field and near field Take near-field and divide further Use translations Trees L2L M2M M2L W r1 (x W R * +t) x y 2 x x * +t t *2 r 1 R (R R) x *1* r W r (x * ) W 1 W r1 (x * +t) W r (x * ) y S W S 2 x *1 x * +t x (S S) t *2 W 1 x i r 1 r W r1 (x * +t) x * +t r y 1 (S R) S t S x (S S) *1 x *2 r W 1 x i

8 Data Structures Separate data into local and global Manage the factorizations Basic geometric construction --- WSPD Stuff inside one area, expanded as a series Translate series representation to other area Expand in the other area

9 Data Structures Need to automatically structure the source and target point sets Can t tile space automatically with nonoverlapping circles/spheres Classically, use quadtree/octrees A recent paper

10 Box and its neighbors

11 Data structures must support Separation: Cells satisfy local separation and WSPD properties across all levels. Indexing: Cell indices satisfy some order relation in memory. Spatial addressing: Cell centers are computable from addresses and each particle can find its bounding cell. Hierarchical addressing: Cell children, parent, and neighbor indices are computable from addresses. All this must be done in O(N log N) and must be parallelizable 1. Nail A. Gumerov, Ramani Duraiswami, and Y.A. Borovikov. Data structures, optimal choice of parameters, and complexity results for generalized fast multipole methods in d dimensions. CS/UMIACS UMIACS-TR , CS-TR Qi Hu, Nail Gumerov and Ramani Duraiswami. Parallel Algorithms for FMM Data Structures Yuancheng Luo and Ramani Duraiswami Alternative Tilings for the Fast Multipole Method on the Plane submitted. Available on arxiv. 2012

12 Direct method vs. FMM

13 Translations Representation size p 1 needs to be translated to another representation of size p 2 In general this is a O(p 1 p 2 ) computation Can reduce this cost by looking at symmetries and approximations of the matrix Coaxial and rotational symmetries exponential expansions and Integral Relations SVD based approximations Nail A. Gumerov and Ramani Duraiswami. A broadband fast multipole accelerated boundary element method for the three dimensional Helmholtz equation. Journal of the Acoustical Society of America, 125: , Nail A. Gumerov and Ramani Duraiswami. Comparison of the efficiency of translation operators used in the fast multipole method for the 3D Laplace equation. Technical Report CS-TR-4701 and UMIACS-TR , University of Maryland Nail A. Gumerov and Ramani Duraiswami. Recursions for the computation of multipole translation and rotation coefficients for the 3-D Helmholtz equation. SIAM Journal on Scientific Computing, 25(4): , 2003.

14 Helmholtz equation Issue: size of special function based factorizations grow with wavenumber times distance Means FMM is actually slower than direct multiplication at high frequencies! Rokhlin solution: diagonal factorization Our solution: level dependent truncation number; use of rotation based symmetries and local expansion with special functions; switch to diagonal representations later More numerically stable; easier to implement in BEM; cheaper

15 FMM on the GPU GPUs are extremely fast at doing computations which have high arithmetic intensity for each load/store, esp. MADD Do other computations, but not with as much speed-up A few cores Hundreds cores Control ALU ALU ALU ALU Cache DRAM CPU DRAM GPU NVIDIA Tesla C2050: 1.25 Tflops single 0.52 Tflops double 448 cores

16 GPU/heterogeneous FMM Moved to implement FMM on GPU almost with the introduction of CUDA (2007) FMM translation based on rotation translation, and showing that it is competitive or superior to asymptotically faster diagonal translations Real valued formulations Algorithm splitting Extension to heterogeneous architectures by mapping local sums to GPU (where they are most efficient) and tree based FMM to CPU

17 Time (s) Local summation on GPU (FMM) 1.E+02 1.E+01 Sparse Matrix-Vector Product s max =60 y=bx CPU: Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: 1.E-01 y=ax 2 4 Time=A 1 N+B 1 N/s+C 1 Ns 1.E-02 1.E GPU y=8 -lmax dx 2 +ex+8 lmax f b/c=16 read/write float computations 1.E-04 b/c=16 3D Laplace 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware Nail A. Gumerov and Ramani Duraiswami. Fast multipole methods on graphical processors. Journal of Computational Physics, 227: , 2008.

18 FMM on heterogeneous architectures Insight --- why do gymnastics to fit tree algorithms with irregular access on the GPU GPUs hosted usually with a multicore CPU Let the GPU do local sum, rest on the CPU

19 Single node algorithm GPU work CPU work ODE solver: source receiver update particle positions, source strength data structure (octree and neighbors) source S-expansions translation stencils local direct sum upward S S downward S R R R receiver R-expansions time loop final sum of far-field and near-field interactions Task 3.5: Computational Considerations in Brownout Simulations 19

20 The algorithm flow chart Node A Node B positions, source strength positions, source strength ODE solver: source receiver update data structure (octree) merge octree data structure (neighbors) data structure (octree) merge octree data structure (neighbors) ODE solver: source receiver update redistributed particle redistributed particle single heterogeneous node algorithm single heterogeneous node algorithm exchange final R-expansions exchange final R-expansions final sum final sum Task 3.5: Computational Considerations in Brownout Simulations 20

21 Fast Gauss Transforms in high dimensions Sums of Gaussians arise in many applications in many dimensions Vortex methods; heat kernels; statistical density estimation; kernel methods in machine learning FMM data structures and factorizations do not work in higher dimensions Alternate global factorization; Data structures (based on k- center algorithm); and a automatically tuned algorithm developed in a series of papers Downloadable code from SourceForge (FigTree) Many papers applying this to vision, machine learning etc. Vikas C. Raykar and Ramani Duraiswami. The improved fast Gauss transform with applications to machine learning. In L. Bottou, O. Chapelle, D. Decoste, and J. Weston, editors, Large Scale Kernel Machines, pages MIT Press, 2007.

22 A few pretty pictures from papers 1 kd=0.96 (250 Hz) x z y p 4 x p 3 p 3 z y y p 3 z y z x x

23 Teaching the FMM Course notes since 2004 online Java Applet that demos the FMM 8 papers directly from the course enabled solidifying broad themes

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani With Nail A. Gumerov,

More information

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore

More information

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E) FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast

More information

Iterative methods for use with the Fast Multipole Method

Iterative methods for use with the Fast Multipole Method Iterative methods for use with the Fast Multipole Method Ramani Duraiswami Perceptual Interfaces and Reality Lab. Computer Science & UMIACS University of Maryland, College Park, MD Joint work with Nail

More information

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Terascale on the desktop: Fast Multipole Methods on Graphical Processors Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami)

More information

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding FMM Data Structures Nail Gumerov & Ramani Duraiswami UMIACS [gumerov][ramani]@umiacs.umd.edu CSCAMM FAM4: 4/9/4 Duraiswami & Gumerov, -4 Content Introduction Hierarchical Space Subdivision with d -Trees

More information

A Kernel-independent Adaptive Fast Multipole Method

A Kernel-independent Adaptive Fast Multipole Method A Kernel-independent Adaptive Fast Multipole Method Lexing Ying Caltech Joint work with George Biros and Denis Zorin Problem Statement Given G an elliptic PDE kernel, e.g. {x i } points in {φ i } charges

More information

Efficient O(N log N) algorithms for scattered data interpolation

Efficient O(N log N) algorithms for scattered data interpolation Efficient O(N log N) algorithms for scattered data interpolation Nail Gumerov University of Maryland Institute for Advanced Computer Studies Joint work with Ramani Duraiswami February Fourier Talks 2007

More information

FMM accelerated BEM for 3D Helmholtz equation

FMM accelerated BEM for 3D Helmholtz equation FMM accelerated BEM for 3D Helmholtz equation Nail A. Gumerov and Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, U.S.A. also @ Fantalgo, LLC, U.S.A. www.umiacs.umd.edu/~gumerov

More information

GPUML: Graphical processors for speeding up kernel machines

GPUML: Graphical processors for speeding up kernel machines GPUML: Graphical processors for speeding up kernel machines http://www.umiacs.umd.edu/~balajiv/gpuml.htm Balaji Vasan Srinivasan, Qi Hu, Ramani Duraiswami Department of Computer Science, University of

More information

Fast Multipole Method on the GPU

Fast Multipole Method on the GPU Fast Multipole Method on the GPU with application to the Adaptive Vortex Method University of Bristol, Bristol, United Kingdom. 1 Introduction Particle methods Highly parallel Computational intensive Numerical

More information

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures Qi Hu huqi@cs.umd.edu Nail A. Gumerov gumerov@umiacs.umd.edu Ramani Duraiswami ramani@umiacs.umd.edu Institute for Advanced Computer

More information

Fast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods.

Fast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods. An Introduction to Fast Multipole Methods Ramani Duraiswami Institute for Advanced Computer Studies University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov

More information

Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation

Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation Nail A. Gumerov Ross Adelman Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies and Fantalgo,

More information

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications Jingfang Huang Department of Mathematics UNC at Chapel Hill Fundamentals of Fast Multipole (type) Method Fundamentals: Low

More information

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Ramani Duraiswami Computer Science & UMIACS University of Maryland, College Park Joint work with Nail

More information

Center for Computational Science

Center for Computational Science Center for Computational Science Toward GPU-accelerated meshfree fluids simulation using the fast multipole method Lorena A Barba Boston University Department of Mechanical Engineering with: Felipe Cruz,

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

Scalable Distributed Fast Multipole Methods

Scalable Distributed Fast Multipole Methods Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami University of Maryland Institute for Advanced Computer Studies (UMIACS) Department of Computer Science, University

More information

Fast multipole methods for axisymmetric geometries

Fast multipole methods for axisymmetric geometries Fast multipole methods for axisymmetric geometries Victor Churchill New York University May 13, 2016 Abstract Fast multipole methods (FMMs) are one of the main numerical methods used in the solution of

More information

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1 ExaFMM Fast multipole method software aiming for exascale systems User's Manual Rio Yokota, L. A. Barba November 2011 --- Revision 1 ExaFMM User's Manual i Revision History Name Date Notes Rio Yokota,

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Kernel Independent FMM

Kernel Independent FMM Kernel Independent FMM FMM Issues FMM requires analytical work to generate S expansions, R expansions, S S (M2M) translations S R (M2L) translations R R (L2L) translations Such analytical work leads to

More information

Tree-based methods on GPUs

Tree-based methods on GPUs Tree-based methods on GPUs Felipe Cruz 1 and Matthew Knepley 2,3 1 Department of Mathematics University of Bristol 2 Computation Institute University of Chicago 3 Department of Molecular Biology and Physiology

More information

Intermediate Parallel Programming & Cluster Computing

Intermediate Parallel Programming & Cluster Computing High Performance Computing Modernization Program (HPCMP) Summer 2011 Puerto Rico Workshop on Intermediate Parallel Programming & Cluster Computing in conjunction with the National Computational Science

More information

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs 3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs H. Knibbe, C. W. Oosterlee, C. Vuik Abstract We are focusing on an iterative solver for the three-dimensional

More information

FMM CMSC 878R/AMSC 698R. Lecture 13

FMM CMSC 878R/AMSC 698R. Lecture 13 FMM CMSC 878R/AMSC 698R Lecture 13 Outline Results of the MLFMM tests Itemized Asymptotic Complexity of the MLFMM; Optimization of the Grouping (Clustering) Parameter; Regular mesh; Random distributions.

More information

Empirical Testing of Fast Kernel Density Estimation Algorithms

Empirical Testing of Fast Kernel Density Estimation Algorithms Empirical Testing of Fast Kernel Density Estimation Algorithms Dustin Lang Mike Klaas { dalang, klaas, nando }@cs.ubc.ca Dept. of Computer Science University of British Columbia Vancouver, BC, Canada V6T1Z4

More information

The Fast Multipole Method (FMM)

The Fast Multipole Method (FMM) The Fast Multipole Method (FMM) Motivation for FMM Computational Physics Problems involving mutual interactions of N particles Gravitational or Electrostatic forces Collective (but weak) long-range forces

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Fast Methods with Sieve

Fast Methods with Sieve Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley

More information

Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms

Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms Xueqian Zhao Department of ECE Michigan Technological University Houghton, MI, 4993 xueqianz@mtu.edu

More information

c 2007 Society for Industrial and Applied Mathematics

c 2007 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 29, No. 5, pp. 1876 1899 c 2007 Society for Industrial and Applied Mathematics FAST RADIAL BASIS FUNCTION INTERPOLATION VIA PRECONDITIONED KRYLOV ITERATION NAIL A. GUMEROV AND

More information

Fast Spherical Filtering in the Broadband FMBEM using a nonequally

Fast Spherical Filtering in the Broadband FMBEM using a nonequally Fast Spherical Filtering in the Broadband FMBEM using a nonequally spaced FFT Daniel R. Wilkes (1) and Alec. J. Duncan (1) (1) Centre for Marine Science and Technology, Department of Imaging and Applied

More information

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor

More information

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

CUDA Experiences: Over-Optimization and Future HPC

CUDA Experiences: Over-Optimization and Future HPC CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign

More information

GPU Implementation of Implicit Runge-Kutta Methods

GPU Implementation of Implicit Runge-Kutta Methods GPU Implementation of Implicit Runge-Kutta Methods Navchetan Awasthi, Abhijith J Supercomputer Education and Research Centre Indian Institute of Science, Bangalore, India navchetanawasthi@gmail.com, abhijith31792@gmail.com

More information

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Reconstruction of Trees from Laser Scan Data and further Simulation Topics Reconstruction of Trees from Laser Scan Data and further Simulation Topics Helmholtz-Research Center, Munich Daniel Ritter http://www10.informatik.uni-erlangen.de Overview 1. Introduction of the Chair

More information

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on

More information

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Computational Fluid Dynamics (CFD) using Graphics Processing Units Computational Fluid Dynamics (CFD) using Graphics Processing Units Aaron F. Shinn Mechanical Science and Engineering Dept., UIUC Accelerators for Science and Engineering Applications: GPUs and Multicores

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Empirical Comparisons of Fast Methods

Empirical Comparisons of Fast Methods Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas {dalang, klaas}@cs.ubc.ca University of British Columbia December 17, 2004 Fast N-Body Learning - Empirical Comparisons p. 1 Sum Kernel

More information

The Fast Multipole Method and the Radiosity Kernel

The Fast Multipole Method and the Radiosity Kernel and the Radiosity Kernel The Fast Multipole Method Sharat Chandran http://www.cse.iitb.ac.in/ sharat January 8, 2006 Page 1 of 43 (Joint work with Alap Karapurkar and Nitin Goel) 1 Copyright c 2005 Sharat

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Accelerated flow acoustic boundary element solver and the noise generation of fish

Accelerated flow acoustic boundary element solver and the noise generation of fish Accelerated flow acoustic boundary element solver and the noise generation of fish JUSTIN W. JAWORSKI, NATHAN WAGENHOFFER, KEITH W. MOORED LEHIGH UNIVERSITY, BETHLEHEM, USA FLINOVIA PENN STATE 27 APRIL

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

Cauchy Fast Multipole Method for Analytic Kernel

Cauchy Fast Multipole Method for Analytic Kernel Cauchy Fast Multipole Method for Analytic Kernel Pierre-David Létourneau 1 Cristopher Cecka 2 Eric Darve 1 1 Stanford University 2 Harvard University ICIAM July 2011 Pierre-David Létourneau Cristopher

More information

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs JH160041-NAHI Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs Rio Yokota(Tokyo Institute of Technology) Hierarchical low-rank approximation methods such as H-matrix, H2-matrix,

More information

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends Imagine stream processor; Bill Dally, Stanford Connection Machine CM; Thinking Machines Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz Eitan Grinspun Caltech Ian Farmer

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2 Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2 Jingwei Zhu March 19, 2014 Instructor: Surya Pratap Vanka 1 Project Description The purpose of this

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

A Broadband Fast Multipole Accelerated Boundary Element Method for the 3D Helmholtz Equation. Abstract

A Broadband Fast Multipole Accelerated Boundary Element Method for the 3D Helmholtz Equation. Abstract A Broadband Fast Multipole Accelerated Boundary Element Method for the 3D Helmholtz Equation Nail A. Gumerov and Ramani Duraiswami (Dated: 31 December 2007, Revised 22 July 2008, Revised 03 October 2008.)

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Parallel multi-frontal solver for isogeometric finite element methods on GPU

Parallel multi-frontal solver for isogeometric finite element methods on GPU Parallel multi-frontal solver for isogeometric finite element methods on GPU Maciej Paszyński Department of Computer Science, AGH University of Science and Technology, Kraków, Poland email: paszynsk@agh.edu.pl

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

FAST MATRIX-VECTOR PRODUCT BASED FGMRES FOR KERNEL MACHINES

FAST MATRIX-VECTOR PRODUCT BASED FGMRES FOR KERNEL MACHINES FAST MATRIX-VECTOR PRODUCT BASED FGMRES FOR KERNEL MACHINES BALAJI VASAN SRINIVASAN, M.S. SCHOLARLY PAPER DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF MARYLAND, COLLEGE PARK Abstract. Kernel based approaches

More information

GPU-based Distributed Behavior Models with CUDA

GPU-based Distributed Behavior Models with CUDA GPU-based Distributed Behavior Models with CUDA Courtesy: YouTube, ISIS Lab, Universita degli Studi di Salerno Bradly Alicea Introduction Flocking: Reynolds boids algorithm. * models simple local behaviors

More information

An Auto-Tuning Framework for Parallel Multicore Stencil Computations

An Auto-Tuning Framework for Parallel Multicore Stencil Computations Software Engineering Seminar Sebastian Hafen An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams 1 Stencils What is a

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Robot Mapping. Least Squares Approach to SLAM. Cyrill Stachniss

Robot Mapping. Least Squares Approach to SLAM. Cyrill Stachniss Robot Mapping Least Squares Approach to SLAM Cyrill Stachniss 1 Three Main SLAM Paradigms Kalman filter Particle filter Graphbased least squares approach to SLAM 2 Least Squares in General Approach for

More information

Graphbased. Kalman filter. Particle filter. Three Main SLAM Paradigms. Robot Mapping. Least Squares Approach to SLAM. Least Squares in General

Graphbased. Kalman filter. Particle filter. Three Main SLAM Paradigms. Robot Mapping. Least Squares Approach to SLAM. Least Squares in General Robot Mapping Three Main SLAM Paradigms Least Squares Approach to SLAM Kalman filter Particle filter Graphbased Cyrill Stachniss least squares approach to SLAM 1 2 Least Squares in General! Approach for

More information

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D Azevedo NVIDIA - Peng

More information

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation

More information

The success of multipole methods: Are we there yet?

The success of multipole methods: Are we there yet? King Abdullah University of Science and Technology Saudi Arabia, Apr. 2830, 2012 The success of multipole methods: Are we there yet? Lorena A Barba, Boston University 1948 The world s first computation

More information

Spatial Data Structures

Spatial Data Structures CSCI 420 Computer Graphics Lecture 17 Spatial Data Structures Jernej Barbic University of Southern California Hierarchical Bounding Volumes Regular Grids Octrees BSP Trees [Angel Ch. 8] 1 Ray Tracing Acceleration

More information

Accelerating Molecular Modeling Applications with Graphics Processors

Accelerating Molecular Modeling Applications with Graphics Processors Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference

More information

Scaling Fast Multipole Methods up to 4000 GPUs

Scaling Fast Multipole Methods up to 4000 GPUs Scaling Fast Multipole Methods up to 4000 GPUs Rio Yokota King Abdullah University of Science and Technology 4700 KAUST, Thuwal 23955-6900, Saudi Arabia +966-2-808-0321 rio.yokota@kaust.edu.sa Lorena Barba

More information

Implementation of rotation-based operators for Fast Multipole Method in X10 Andrew Haigh

Implementation of rotation-based operators for Fast Multipole Method in X10 Andrew Haigh School of Computer Science College of Engineering and Computer Science Implementation of rotation-based operators for Fast Multipole Method in X10 Andrew Haigh Supervisors: Dr Eric McCreath Josh Milthorpe

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS YIRAN LI APPLIED MATHEMATICS, STATISTICS AND SCIENTIFIC COMPUTING ADVISOR: DR. WOJTEK CZAJA, DR. JOHN BENEDETTO DEPARTMENT

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller T6: Position-Based Simulation Methods in Computer Graphics Jan Bender Miles Macklin Matthias Müller Jan Bender Organizer Professor at the Visual Computing Institute at Aachen University Research topics

More information

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING Int. J. Numer. Meth. Engng 2012; 89:105 133 Published online 3 August 2011 in Wiley Online Library (wileyonlinelibrary.com)..3240 Optimizing the

More information

The Immersed Interface Method

The Immersed Interface Method The Immersed Interface Method Numerical Solutions of PDEs Involving Interfaces and Irregular Domains Zhiiin Li Kazufumi Ito North Carolina State University Raleigh, North Carolina Society for Industrial

More information

Panel methods are currently capable of rapidly solving the potential flow equation on rather complex

Panel methods are currently capable of rapidly solving the potential flow equation on rather complex A Fast, Unstructured Panel Solver John Moore 8.337 Final Project, Fall, 202 A parallel high-order Boundary Element Method accelerated by the Fast Multipole Method is presented in this report. The case

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Spatial Data Structures

Spatial Data Structures CSCI 480 Computer Graphics Lecture 7 Spatial Data Structures Hierarchical Bounding Volumes Regular Grids BSP Trees [Ch. 0.] March 8, 0 Jernej Barbic University of Southern California http://www-bcf.usc.edu/~jbarbic/cs480-s/

More information

A comparison of parallel rank-structured solvers

A comparison of parallel rank-structured solvers A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.

More information

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods

The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods Boundary Elements and Other Mesh Reduction Methods XXXVII 121 The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods A. Buchau

More information

CS560 Lecture Parallel Architecture 1

CS560 Lecture Parallel Architecture 1 Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency

More information

Automatic online tuning for fast Gaussian summation

Automatic online tuning for fast Gaussian summation Automatic online tuning for fast Gaussian summation Vlad I. Morariu 1, Balaji V. Srinivasan 1, Vikas C. Raykar 2, Ramani Duraiswami 1, and Larry S. Davis 1 1 University of Maryland, College Park, MD 20742

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

OpenACC programming for GPGPUs: Rotor wake simulation

OpenACC programming for GPGPUs: Rotor wake simulation DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

TERM PAPER ON The Compressive Sensing Based on Biorthogonal Wavelet Basis

TERM PAPER ON The Compressive Sensing Based on Biorthogonal Wavelet Basis TERM PAPER ON The Compressive Sensing Based on Biorthogonal Wavelet Basis Submitted By: Amrita Mishra 11104163 Manoj C 11104059 Under the Guidance of Dr. Sumana Gupta Professor Department of Electrical

More information

2D vector fields 3. Contents. Line Integral Convolution (LIC) Image based flow visualization Vector field topology. Fast LIC Oriented LIC

2D vector fields 3. Contents. Line Integral Convolution (LIC) Image based flow visualization Vector field topology. Fast LIC Oriented LIC 2D vector fields 3 Scientific Visualization (Part 8) PD Dr.-Ing. Peter Hastreiter Contents Line Integral Convolution (LIC) Fast LIC Oriented LIC Image based flow visualization Vector field topology 2 Applied

More information