Fast Multipole and Related Algorithms

Similar documents
GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Iterative methods for use with the Fast Multipole Method

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding

A Kernel-independent Adaptive Fast Multipole Method

Efficient O(N log N) algorithms for scattered data interpolation

FMM accelerated BEM for 3D Helmholtz equation

GPUML: Graphical processors for speeding up kernel machines

Fast Multipole Method on the GPU

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Fast Multipole Methods. Linear Systems. Matrix vector product. An Introduction to Fast Multipole Methods.

Fast Multipole Accelerated Indirect Boundary Elements for the Helmholtz Equation

Low-rank Properties, Tree Structure, and Recursive Algorithms with Applications. Jingfang Huang Department of Mathematics UNC at Chapel Hill

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics

Center for Computational Science

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Scalable Distributed Fast Multipole Methods

Fast multipole methods for axisymmetric geometries

ExaFMM. Fast multipole method software aiming for exascale systems. User's Manual. Rio Yokota, L. A. Barba. November Revision 1

Using GPUs to compute the multilevel summation of electrostatic forces

Kernel Independent FMM

Tree-based methods on GPUs

Intermediate Parallel Programming & Cluster Computing

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

FMM CMSC 878R/AMSC 698R. Lecture 13

Empirical Testing of Fast Kernel Density Estimation Algorithms

The Fast Multipole Method (FMM)

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Fast Methods with Sieve

Fast Multipole Method on GPU: Tackling 3-D Capacitance Extraction on Massively Parallel SIMD Platforms

c 2007 Society for Industrial and Applied Mathematics

Fast Spherical Filtering in the Broadband FMBEM using a nonequally

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

HPC Algorithms and Applications

CUDA Experiences: Over-Optimization and Future HPC

GPU Implementation of Implicit Runge-Kutta Methods

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Computational Fluid Dynamics (CFD) using Graphics Processing Units

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Empirical Comparisons of Fast Methods

The Fast Multipole Method and the Radiosity Kernel

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Accelerated flow acoustic boundary element solver and the noise generation of fish

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Cauchy Fast Multipole Method for Analytic Kernel

Hierarchical Low-Rank Approximation Methods on Distributed Memory and GPUs

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Optimisation Myths and Facts as Seen in Statistical Physics

8. Hardware-Aware Numerics. Approaching supercomputing...

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2

8. Hardware-Aware Numerics. Approaching supercomputing...

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallel Implementations of Gaussian Elimination

A Broadband Fast Multipole Accelerated Boundary Element Method for the 3D Helmholtz Equation. Abstract

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Parallel multi-frontal solver for isogeometric finite element methods on GPU

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

FAST MATRIX-VECTOR PRODUCT BASED FGMRES FOR KERNEL MACHINES

GPU-based Distributed Behavior Models with CUDA

An Auto-Tuning Framework for Parallel Multicore Stencil Computations

On the limits of (and opportunities for?) GPU acceleration

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

Robot Mapping. Least Squares Approach to SLAM. Cyrill Stachniss

Graphbased. Kalman filter. Particle filter. Three Main SLAM Paradigms. Robot Mapping. Least Squares Approach to SLAM. Least Squares in General

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY. Stephen Abbott, March

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

The success of multipole methods: Are we there yet?

Spatial Data Structures

Accelerating Molecular Modeling Applications with Graphics Processors

Scaling Fast Multipole Methods up to 4000 GPUs

Implementation of rotation-based operators for Fast Multipole Method in X10 Andrew Haigh

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units

The Immersed Interface Method

Panel methods are currently capable of rapidly solving the potential flow equation on rather complex

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Spatial Data Structures

A comparison of parallel rank-structured solvers

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

The meshfree computation of stationary electric current densities in complex shaped conductors using 3D boundary element methods

CS560 Lecture Parallel Architecture 1

Automatic online tuning for fast Gaussian summation

How to Optimize Geometric Multigrid Methods on GPUs

OpenACC programming for GPGPUs: Rotor wake simulation

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

TERM PAPER ON The Compressive Sensing Based on Biorthogonal Wavelet Basis

2D vector fields 3. Contents. Line Integral Convolution (LIC) Image based flow visualization Vector field topology. Fast LIC Oriented LIC

Transcription:

Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov

Efficiency by exploiting symmetry and A general paradigm structure Develop theory, algorithms and software that apply to general classes of data Recognize underlying mathematical symmetry Recognize structure in data Adapt algorithms to particular data How does this apply to the FMM?

FMM Fast summation of singular kernel functions Green s functions Φ(x,y) for classical operator L Solution y to forcing at points x Applications Electrostatics, Molecular/Stellar dynamics, Vortex Boundary Integral Methods Particle discretization of field equations

What is the FMM? Decompose singular sum in to local or sparse part + far-field and dense part

Factorization Trick apply to dense part Not FMM, but has some key ideas Consider v(y j )= N i=1 ij u i = N i=1 u i (x i y j ) 2 j=1,,m Naïve way to evaluate sum requires MN operations Instead can write the sum as v(y j )= ( N i=1 u i ) y j2 +( N i=1 u i x i2 ) 2y j ( N i=1 u i x i ) Can evaluate each bracketed sum over i =( N i=1 u i ), = ( N i=1 u i x i2 ), = ( N i=1 u i x i ) and evaluate v(y j )= y 2 j + 2y j Requires O(M+N) operations

Reduction of Complexity Straightforward (nested loops): Factorized: Complexity: O(MN) If p << min(m,n) then complexity reduces! Complexity: O(pN+pM)

Take this far-near decomposition and apply it recursively Far field and near field Take near-field and divide further Use translations Trees L2L M2M M2L W r1 (x W R * +t) x y 2 x x * +t t *2 r 1 R (R R) x *1* r W r (x * ) W 1 W r1 (x * +t) W r (x * ) y S W S 2 x *1 x * +t x (S S) t *2 W 1 x i r 1 r W r1 (x * +t) x * +t r y 1 (S R) S t S x (S S) *1 x *2 r W 1 x i

Data Structures Separate data into local and global Manage the factorizations Basic geometric construction --- WSPD Stuff inside one area, expanded as a series Translate series representation to other area Expand in the other area

Data Structures Need to automatically structure the source and target point sets Can t tile space automatically with nonoverlapping circles/spheres Classically, use quadtree/octrees A recent paper

Box and its neighbors

Data structures must support Separation: Cells satisfy local separation and WSPD properties across all levels. Indexing: Cell indices satisfy some order relation in memory. Spatial addressing: Cell centers are computable from addresses and each particle can find its bounding cell. Hierarchical addressing: Cell children, parent, and neighbor indices are computable from addresses. All this must be done in O(N log N) and must be parallelizable 1. Nail A. Gumerov, Ramani Duraiswami, and Y.A. Borovikov. Data structures, optimal choice of parameters, and complexity results for generalized fast multipole methods in d dimensions. CS/UMIACS UMIACS-TR-2003-28, CS-TR-4458. 2. Qi Hu, Nail Gumerov and Ramani Duraiswami. Parallel Algorithms for FMM Data Structures. 2011 3. Yuancheng Luo and Ramani Duraiswami Alternative Tilings for the Fast Multipole Method on the Plane submitted. Available on arxiv. 2012

Direct method vs. FMM

Translations Representation size p 1 needs to be translated to another representation of size p 2 In general this is a O(p 1 p 2 ) computation Can reduce this cost by looking at symmetries and approximations of the matrix Coaxial and rotational symmetries exponential expansions and Integral Relations SVD based approximations Nail A. Gumerov and Ramani Duraiswami. A broadband fast multipole accelerated boundary element method for the three dimensional Helmholtz equation. Journal of the Acoustical Society of America, 125:191 205, 2009. Nail A. Gumerov and Ramani Duraiswami. Comparison of the efficiency of translation operators used in the fast multipole method for the 3D Laplace equation. Technical Report CS-TR-4701 and UMIACS-TR 2005-09, University of Maryland Nail A. Gumerov and Ramani Duraiswami. Recursions for the computation of multipole translation and rotation coefficients for the 3-D Helmholtz equation. SIAM Journal on Scientific Computing, 25(4):1344 1381, 2003.

Helmholtz equation Issue: size of special function based factorizations grow with wavenumber times distance Means FMM is actually slower than direct multiplication at high frequencies! Rokhlin solution: diagonal factorization Our solution: level dependent truncation number; use of rotation based symmetries and local expansion with special functions; switch to diagonal representations later More numerically stable; easier to implement in BEM; cheaper

FMM on the GPU GPUs are extremely fast at doing computations which have high arithmetic intensity for each load/store, esp. MADD Do other computations, but not with as much speed-up A few cores Hundreds cores Control ALU ALU ALU ALU Cache DRAM CPU DRAM GPU NVIDIA Tesla C2050: 1.25 Tflops single 0.52 Tflops double 448 cores

GPU/heterogeneous FMM Moved to implement FMM on GPU almost with the introduction of CUDA (2007) FMM translation based on rotation translation, and showing that it is competitive or superior to asymptotically faster diagonal translations Real valued formulations Algorithm splitting Extension to heterogeneous architectures by mapping local sums to GPU (where they are most efficient) and tree based FMM to CPU

Time (s) Local summation on GPU (FMM) 1.E+02 1.E+01 Sparse Matrix-Vector Product s max =60 y=bx CPU: Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: 1.E-01 y=ax 2 4 Time=A 1 N+B 1 N/s+C 1 Ns 1.E-02 1.E-03 2 3 GPU y=8 -lmax dx 2 +ex+8 lmax f b/c=16 read/write float computations 1.E-04 b/c=16 3D Laplace 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware Nail A. Gumerov and Ramani Duraiswami. Fast multipole methods on graphical processors. Journal of Computational Physics, 227:8290 8313, 2008.

FMM on heterogeneous architectures Insight --- why do gymnastics to fit tree algorithms with irregular access on the GPU GPUs hosted usually with a multicore CPU Let the GPU do local sum, rest on the CPU

Single node algorithm GPU work CPU work ODE solver: source receiver update particle positions, source strength data structure (octree and neighbors) source S-expansions translation stencils local direct sum upward S S downward S R R R receiver R-expansions time loop final sum of far-field and near-field interactions Task 3.5: Computational Considerations in Brownout Simulations 19

The algorithm flow chart Node A Node B positions, source strength positions, source strength ODE solver: source receiver update data structure (octree) merge octree data structure (neighbors) data structure (octree) merge octree data structure (neighbors) ODE solver: source receiver update redistributed particle redistributed particle single heterogeneous node algorithm single heterogeneous node algorithm exchange final R-expansions exchange final R-expansions final sum final sum Task 3.5: Computational Considerations in Brownout Simulations 20

Fast Gauss Transforms in high dimensions Sums of Gaussians arise in many applications in many dimensions Vortex methods; heat kernels; statistical density estimation; kernel methods in machine learning FMM data structures and factorizations do not work in higher dimensions Alternate global factorization; Data structures (based on k- center algorithm); and a automatically tuned algorithm developed in a series of papers Downloadable code from SourceForge (FigTree) Many papers applying this to vision, machine learning etc. Vikas C. Raykar and Ramani Duraiswami. The improved fast Gauss transform with applications to machine learning. In L. Bottou, O. Chapelle, D. Decoste, and J. Weston, editors, Large Scale Kernel Machines, pages 175 202. MIT Press, 2007.

A few pretty pictures from papers 1 kd=0.96 (250 Hz) x z y p 4 x p 3 p 3 z y y p 3 z y z x x

Teaching the FMM Course notes since 2004 online www.umiacs.umd.edu/~ramani/cmsc878r Java Applet that demos the FMM 8 papers directly from the course enabled solidifying broad themes