FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Similar documents
CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Fast Multipole and Related Algorithms

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics

Efficient O(N log N) algorithms for scattered data interpolation

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding

Using GPUs to compute the multilevel summation of electrostatic forces

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

FMM CMSC 878R/AMSC 698R. Lecture 13

Center for Computational Science

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Iterative methods for use with the Fast Multipole Method

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

Introduction to CUDA

Technology for a better society. hetcomp.com

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

A Cross-Input Adaptive Framework for GPU Program Optimizations

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

Device Memories and Matrix Multiplication

General Purpose GPU Computing in Partial Wave Analysis

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Advanced CUDA Optimization 1. Introduction

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Numerical Simulation on the GPU

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Tesla GPU Computing A Revolution in High Performance Computing

Optimization solutions for the segmented sum algorithmic function

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA Programming Model

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Dense matching GPU implementation

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing

A Kernel-independent Adaptive Fast Multipole Method

CS560 Lecture Parallel Architecture 1

GPU Fundamentals Jeff Larkin November 14, 2016

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

QR Decomposition on GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Effect of memory latency

Parallel FFT Program Optimizations on Heterogeneous Computers

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 1: Introduction and Computational Thinking

Introduction to Parallel Computing with CUDA. Oswald Haan

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CUDA (Compute Unified Device Architecture)

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Fast Multipole Method on the GPU

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Applications of Berkeley s Dwarfs on Nvidia GPUs

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Tesla Architecture, CUDA and Optimization Strategies

Lecture 8: GPU Programming. CSE599G1: Spring 2017

On the limits of (and opportunities for?) GPU acceleration

Large-scale Deep Unsupervised Learning using Graphics Processors

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

POST-SIEVING ON GPUs

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

High Performance Computing and GPU Programming

University of Bielefeld

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Lecture 2: Memory Systems

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Programming in CUDA. Malik M Khan

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Sparse Linear Algebra in CUDA

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

CME 213 S PRING Eric Darve

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Transcription:

FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E)

Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast Matrix-Vector Multiplication on GPU with On-Core Computable Kernel FMM Sparse Matrix-Vector Multiplication on GPU Change of the Cost/Optimization scheme on GPU compared to CPU Other steps Test Results Profile Accuracy Overall Performance

Publication The major part of this work has been published: N.A. Gumerov and R. Duraiswami Fast multipole methods on graphics processors. Journal of Computational Physics, 227, 8290-8313, 2008.

Two parts of the FMM Generate data structure (define the matrix); Perform Matrix-vector product (run); Two different types of problems: Static matrix (e.g. for iterative solution of large linear system): Data structure should be generated once, while run should be performed many times with different input vectors; Dynamic matrix (e.g. N-body problem) Data structure should be generated each time, when matrixvector multiplication is needed.

Data Structure We need O(NlogN) procedure to index all the boxes, find children/parent and neighbors; Implemented on CPU using bit-interleaving (Morton- Peano indexing in octree) and sorting; We generate lists of Children/Parent/Neighbors/E4- neighbors and store in global memory; Recently, graduate students Michael Lieberman and Adam O Donovan implemented basic data structures on GPU (prefix sort, etc.) and found of order 10 speedups compared to the CPU for large data sets. In the present lecture we will focus on the run algorithm, assuming that the necessary lists are generated and stored in the global memory.

Flow Chart of FMM Run Level l max Levels l max -1,, 2 Level 2 Levels 3,,l max 1. Get S-expansion coefficients (directly) 2. Get S-expansion coefficients from children (S S translation) 3. Get R-expansion coefficients from far neighbors (S R translation) 4. Get R-expansion coefficients from far neighbors (S R translation) Start End Level l max 7. Sum sources in close neighborhood (directly) Level l max 6. Evaluate R-expansions (directly) 5. Get R-expansion coefficients from parents (R R translation) Direct Sparse Matrix- Vector Multiplication Approximate Dense Matrix- Vector Multiplication

FMM Cost (Operations) Points are clustered with clustering parameter s, means not more that s points in the smallest octree box. Expansions have length p 2 (number of coefficients). There are O(N) points and O(N boxes )=O(N/s) in the octree. O(Np 2 ) O(N boxes p 3 ) O(p 3 ) O(N boxes p 3 ) Level l max 1. Get S-expansion coefficients (directly) Levels l max -1,, 2 2. Get S-expansion coefficients from children (S S translation) Level 2 3. Get R-expansion coefficients from far neighbors (S R translation) Levels 3,,l max 4. Get R-expansion coefficients from far neighbors (S R translation) Start End Level l max 7. Sum sources in close neighborhood (directly) Level l max 6. Evaluate R-expansions (directly) 5. Get R-expansion coefficients from parents (R R translation) O(Ns) O(Np 2 ) O(N boxes p 3 )

FMM Cost/Optimization Since p=o(1), we have (operations) TotalCost = AN+BN noxes +CNs =N(A+B/s+Cs) Total Cost Cs s opt =(B/C) 1/2 B/s s opt s

Challenges Complex FMM data structure; Problem is not native for SIMD semantics; non-uniformity of data causes problems with efficient work load (taking into account large number of threads); serial algorithms use recursive computations; existing libraries (CUBLAS) and middleware approach are not sufficient; high performing FMM functions should be redesigned and written in CU; Low fast (shared/constant) memory for efficient implementation of translation operators; Absence of good debugging tools for GPU.

FMM on GPU

Implementation Main algorithm is in Fortran 95 Data structures are computed on CPU Interface between Fortran and GPU is provided by our middleware (Flagon), which is a layer between Fortran and C++/CUDA High performance custom subroutines written using Cu

Availability Fortran-9X version is released for free public use. It is called FLAGON. http://flagon.wiki.sourceforge.net/

High performance direct summation on GPU (total) 1.E+04 1.E+03 Computations of potential y=ax 2 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E+02 1.E+01 CPU, Direct y=bx 2 GPU: NVIDIA GeForce 8800 GTX (peak 330 GFLOPS). Time (s) 1.E+00 1.E-01 GPU, Direct Estimated achieved rate: 190 GFLOPS. 1.E-02 y=bx 2 +cx+d 1.E-03 1.E-04 b/a=600 3D Laplace 1.E+03 1.E+04 1.E+05 1.E+06 CPU direct: serial code; no use of partial caching; no loop unrolling; (simple execution of nested loop) Number of Sources

Algorithm (in words) 1. All source data, including coordinates and source intensities are stored in a single array in the global GPU memory (the size of the array is 4N for potential evaluation and 7N for potential/force evaluation). Two more large arrays reside in the global memory, one with the receiver coordinates (size 3M), and the other is allocated for the result of the matrix vector product (M for potential computations and 4M for potential and force computations). In case the sources and receivers are the same, as in particle dynamics computations, these can be reduced. 2. The routine is executed in CUDA on a one dimensional grid of one dimensional blocks of threads. By default each block contains 256 threads, which was determined via an empirical study of optimal thread-block size. This study also showed that for good performance the thread block grid should contain not less than 32 blocks for a 16 multiprocessor configuration. If the number of receivers is not a divisor of the block size, only the remaining number of threads is employed for computations of the last block. 3. Each thread in the block handles computation of one element of the output vector (or one receiver). For this purpose a register for potential calculation (and three registers for calculation of forces) are allocated and initialized to zero. The thread can take care of another receiver only after the entire block of threads is executed, threads are synchronized, and the shared memory can be overwritten. 4. Each thread reads the respective receiver coordinates directly from the global memory and puts them into the shared memory. 5. For a given block the source data are executed in a loop by batches of size B floats, where B depends on the type of the kernel (4 or 7 floats per source for monopole or monopole/dipole summations, respectively) and the size of the shared memory available. Currently we use 8 kb, or 2048 floats of the shared memory (so e.g. for monopole summation this determines B ј 512). If the number of sources is not a divisor of B the last batch takes care of the remaining number of sources. 6. All threads of the block are employed to read source data for the given batch from global memory and put them into the shared memory (consequent reading: one thread per one float, until the entire batch is loaded followed by thread synchronization). 7. Each thread computes the product of one matrix element (kernel) with the source intensity (or in the case of dipoles the dipole moment with the kernel) and sums it up with the content of the respective summation register. 8. After summation of contributions of all sources (all batches) each thread writes the sum from its register into the global memory.

Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU 1.E+02 Sparse Matrix-Vector Product s max =60 y=bx CPU: 1.E+01 Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: Time (s) 1.E-01 1.E-02 1.E-03 2 y=ax 2 4 3 GPU b/c=16 y=8 -lmax dx 2 +ex+8 lmax f Time=A 1 N+B 1 N/s+C 1 Ns read/write float computations 3D Laplace 1.E-04 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware

Algorithm modification 1. Each block of threads executes computations for one receiver box (Blockper-box parallelization). 2. Each block of threads accesses the data structure. 3. Loop over all sources contributing to the result requires partially non-coalesced read of the source data (box by box).

Optimization Sparse M-V product Time Dense M-V product Time Time CPU: + = 0 s 0 s 0 s opt s Time Time Time GPU: + = 0 s1 0 0 s s s 1 s opt s

Direct summation on GPU (final step in the FMM) Compare GPU final summation complexity: and total FMM complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, and this can be only increased for the full algorithm, since its complexity Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2.

Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU 1.E+03 1.E+02 Sparse Matrix-Vector Product s max =320 y=bx 1.E+01 y=ax 2 CPU l max =4 y=cx Time (s) 1.E+00 1.E-01 1.E-02 2 3 GPU b/c=300 1.E-03 y=8 -lmax dx 2 +ex+8 lmax f 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources

Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU and GPU N Serial CPU (s) GPU(s) Time Ratio 4096 3.51E-02 1.50E-03 23 8192 1.34E-01 2.96E-03 45 16384 5.26E-01 7.50E-03 70 32768 3.22E-01 1.51E-02 21 65536 1.23E+00 2.75E-02 45 131072 4.81E+00 7.13E-02 68 262144 2.75E+00 1.20E-01 23 524288 1.05E+01 2.21E-01 47 1048576 4.10E+01 5.96E-01 69

Important conclusion: Since the optimal max level of the octree when using GPU is lesser than that for the CPU, the importance of optimization of translation subroutines diminishes.

Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction): 30-60.

Profile

Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used.

What is more accurate for solution of large problems on GPU: direct summation or FMM? L2-relative error 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E-08 1.E-09 Error in computations of potential FMM FMM FMM GPU CPU Direct p=8 p=4 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources p=12 Filled = GPU, Empty = CPU Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. Possible reason why the GPU error in direct summation grows: systematic roundoff error in computation of function 1/sqrt(x). (still a question).

Performance N=1,048,576 (potential only) serial CPU GPU Ratio p=4 22.25 s 0.683 s 33 p=8 51.17 s 0.908 s 56 p=12 66.56 s 1.395 s 48 N=1,048,576 (potential+forces (gradient)) serial CPU GPU Ratio p=4 28.37 s 0.979 s 29 p=8 88.09 s 1.227 s 72 p=12 116.1 s 1.761 s 66

Performance p=4 p=8 p=12 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=cx R u n T im e (s ) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx R u n T im e (s ) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx R u n T im e (s ) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM y=bx 2 GPU, FMM y=dx 1.E-02 GPU, FMM 1.E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources a/b = 600 c/d = 30 p=4, FMM error ~ 2 10-4 1.E-02 GPU, FMM 1.E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources a/b = 600 c/d = 50 p=8, FMM error ~ 10-5 1.E-02 GPU, Direct 1.E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources a/b = 600 c/d = 50 p=12, FMM error ~ 10-6

Performance FMM GPU Computations of the potential and forces: Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range 25-50 Teraflops are observed (following the citation below). dir CPU FMM M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, 1997. direct