Fast Multipole and Related Algorithms

Fast Multipole and Related Algorithms Ramani Duraiswami University of Maryland, College Park http://www.umiacs.umd.edu/~ramani Joint work with Nail A. Gumerov

Efficiency by exploiting symmetry and A general paradigm structure Develop theory, algorithms and software that apply to general classes of data Recognize underlying mathematical symmetry Recognize structure in data Adapt algorithms to particular data How does this apply to the FMM?

FMM Fast summation of singular kernel functions Green s functions Φ(x,y) for classical operator L Solution y to forcing at points x Applications Electrostatics, Molecular/Stellar dynamics, Vortex Boundary Integral Methods Particle discretization of field equations

What is the FMM? Decompose singular sum in to local or sparse part + far-field and dense part

Factorization Trick apply to dense part Not FMM, but has some key ideas Consider v(y j )= N i=1 ij u i = N i=1 u i (x i y j ) 2 j=1,,m Naïve way to evaluate sum requires MN operations Instead can write the sum as v(y j )= ( N i=1 u i ) y j2 +( N i=1 u i x i2 ) 2y j ( N i=1 u i x i ) Can evaluate each bracketed sum over i =( N i=1 u i ), = ( N i=1 u i x i2 ), = ( N i=1 u i x i ) and evaluate v(y j )= y 2 j + 2y j Requires O(M+N) operations

Reduction of Complexity Straightforward (nested loops): Factorized: Complexity: O(MN) If p << min(m,n) then complexity reduces! Complexity: O(pN+pM)

Take this far-near decomposition and apply it recursively Far field and near field Take near-field and divide further Use translations Trees L2L M2M M2L W r1 (x W R * +t) x y 2 x x * +t t *2 r 1 R (R R) x *1* r W r (x * ) W 1 W r1 (x * +t) W r (x * ) y S W S 2 x *1 x * +t x (S S) t *2 W 1 x i r 1 r W r1 (x * +t) x * +t r y 1 (S R) S t S x (S S) *1 x *2 r W 1 x i

Data Structures Separate data into local and global Manage the factorizations Basic geometric construction --- WSPD Stuff inside one area, expanded as a series Translate series representation to other area Expand in the other area

Data Structures Need to automatically structure the source and target point sets Can t tile space automatically with nonoverlapping circles/spheres Classically, use quadtree/octrees A recent paper

Box and its neighbors

Data structures must support Separation: Cells satisfy local separation and WSPD properties across all levels. Indexing: Cell indices satisfy some order relation in memory. Spatial addressing: Cell centers are computable from addresses and each particle can find its bounding cell. Hierarchical addressing: Cell children, parent, and neighbor indices are computable from addresses. All this must be done in O(N log N) and must be parallelizable 1. Nail A. Gumerov, Ramani Duraiswami, and Y.A. Borovikov. Data structures, optimal choice of parameters, and complexity results for generalized fast multipole methods in d dimensions. CS/UMIACS UMIACS-TR-2003-28, CS-TR-4458. 2. Qi Hu, Nail Gumerov and Ramani Duraiswami. Parallel Algorithms for FMM Data Structures. 2011 3. Yuancheng Luo and Ramani Duraiswami Alternative Tilings for the Fast Multipole Method on the Plane submitted. Available on arxiv. 2012

Direct method vs. FMM

Translations Representation size p 1 needs to be translated to another representation of size p 2 In general this is a O(p 1 p 2 ) computation Can reduce this cost by looking at symmetries and approximations of the matrix Coaxial and rotational symmetries exponential expansions and Integral Relations SVD based approximations Nail A. Gumerov and Ramani Duraiswami. A broadband fast multipole accelerated boundary element method for the three dimensional Helmholtz equation. Journal of the Acoustical Society of America, 125:191 205, 2009. Nail A. Gumerov and Ramani Duraiswami. Comparison of the efficiency of translation operators used in the fast multipole method for the 3D Laplace equation. Technical Report CS-TR-4701 and UMIACS-TR 2005-09, University of Maryland Nail A. Gumerov and Ramani Duraiswami. Recursions for the computation of multipole translation and rotation coefficients for the 3-D Helmholtz equation. SIAM Journal on Scientific Computing, 25(4):1344 1381, 2003.

Helmholtz equation Issue: size of special function based factorizations grow with wavenumber times distance Means FMM is actually slower than direct multiplication at high frequencies! Rokhlin solution: diagonal factorization Our solution: level dependent truncation number; use of rotation based symmetries and local expansion with special functions; switch to diagonal representations later More numerically stable; easier to implement in BEM; cheaper

FMM on the GPU GPUs are extremely fast at doing computations which have high arithmetic intensity for each load/store, esp. MADD Do other computations, but not with as much speed-up A few cores Hundreds cores Control ALU ALU ALU ALU Cache DRAM CPU DRAM GPU NVIDIA Tesla C2050: 1.25 Tflops single 0.52 Tflops double 448 cores

GPU/heterogeneous FMM Moved to implement FMM on GPU almost with the introduction of CUDA (2007) FMM translation based on rotation translation, and showing that it is competitive or superior to asymptotically faster diagonal translations Real valued formulations Algorithm splitting Extension to heterogeneous architectures by mapping local sums to GPU (where they are most efficient) and tree based FMM to CPU

Time (s) Local summation on GPU (FMM) 1.E+02 1.E+01 Sparse Matrix-Vector Product s max =60 y=bx CPU: Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: 1.E-01 y=ax 2 4 Time=A 1 N+B 1 N/s+C 1 Ns 1.E-02 1.E-03 2 3 GPU y=8 -lmax dx 2 +ex+8 lmax f b/c=16 read/write float computations 1.E-04 b/c=16 3D Laplace 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware Nail A. Gumerov and Ramani Duraiswami. Fast multipole methods on graphical processors. Journal of Computational Physics, 227:8290 8313, 2008.

FMM on heterogeneous architectures Insight --- why do gymnastics to fit tree algorithms with irregular access on the GPU GPUs hosted usually with a multicore CPU Let the GPU do local sum, rest on the CPU

Single node algorithm GPU work CPU work ODE solver: source receiver update particle positions, source strength data structure (octree and neighbors) source S-expansions translation stencils local direct sum upward S S downward S R R R receiver R-expansions time loop final sum of far-field and near-field interactions Task 3.5: Computational Considerations in Brownout Simulations 19

The algorithm flow chart Node A Node B positions, source strength positions, source strength ODE solver: source receiver update data structure (octree) merge octree data structure (neighbors) data structure (octree) merge octree data structure (neighbors) ODE solver: source receiver update redistributed particle redistributed particle single heterogeneous node algorithm single heterogeneous node algorithm exchange final R-expansions exchange final R-expansions final sum final sum Task 3.5: Computational Considerations in Brownout Simulations 20

Fast Gauss Transforms in high dimensions Sums of Gaussians arise in many applications in many dimensions Vortex methods; heat kernels; statistical density estimation; kernel methods in machine learning FMM data structures and factorizations do not work in higher dimensions Alternate global factorization; Data structures (based on k- center algorithm); and a automatically tuned algorithm developed in a series of papers Downloadable code from SourceForge (FigTree) Many papers applying this to vision, machine learning etc. Vikas C. Raykar and Ramani Duraiswami. The improved fast Gauss transform with applications to machine learning. In L. Bottou, O. Chapelle, D. Decoste, and J. Weston, editors, Large Scale Kernel Machines, pages 175 202. MIT Press, 2007.

A few pretty pictures from papers 1 kd=0.96 (250 Hz) x z y p 4 x p 3 p 3 z y y p 3 z y z x x

Teaching the FMM Course notes since 2004 online www.umiacs.umd.edu/~ramani/cmsc878r Java Applet that demos the FMM 8 papers directly from the course enabled solidifying broad themes