CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Similar documents
FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Fast Multipole and Related Algorithms

Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics

GPU accelerated heterogeneous computing for Particle/FMM Approaches and for Acoustic Imaging

Efficient O(N log N) algorithms for scattered data interpolation

Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures

Using GPUs to compute the multilevel summation of electrostatic forces

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Tesla GPU Computing A Revolution in High Performance Computing

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

General Purpose GPU Computing in Partial Wave Analysis

Introduction to GPU computing

Center for Computational Science

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

A Kernel-independent Adaptive Fast Multipole Method

FMM Data Structures. Content. Introduction Hierarchical Space Subdivision with 2 d -Trees Hierarchical Indexing System Parent & Children Finding

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Device Memories and Matrix Multiplication

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Lecture 1: Introduction and Computational Thinking

Tesla GPU Computing A Revolution in High Performance Computing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Iterative methods for use with the Fast Multipole Method

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

On the limits of (and opportunities for?) GPU acceleration

Technology for a better society. hetcomp.com

Accelerating Molecular Modeling Applications with Graphics Processors

Scalable Distributed Fast Multipole Methods

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

QR Decomposition on GPUs

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Effect of memory latency

CS560 Lecture Parallel Architecture 1

Tesla Architecture, CUDA and Optimization Strategies

Di Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 9: MIMD Architecture

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Warps and Reduction Algorithms

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Advanced CUDA Optimization 1. Introduction

GPU Cluster Computing. Advanced Computing Center for Research and Education

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Numerical Simulation on the GPU

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to CELL B.E. and GPU Programming. Agenda

Fast Multipole Method on the GPU

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

FMM CMSC 878R/AMSC 698R. Lecture 13

Tree-based methods on GPUs

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Experts in Application Acceleration Synective Labs AB

Windowing System on a 3D Pipeline. February 2005

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Parallel Systems I The GPU architecture. Jan Lemeire

Mathematical computations with GPUs

Parallel FFT Program Optimizations on Heterogeneous Computers

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Large scale Imaging on Current Many- Core Platforms

Optimization solutions for the segmented sum algorithmic function

Large-scale Deep Unsupervised Learning using Graphics Processors

Programming in CUDA. Malik M Khan

Krishnan Suresh Associate Professor Mechanical Engineering

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CUDA Programming Model

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Nvidia Tesla The Personal Supercomputer

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Computing. November 20, W.Homberg

HPC with Multicore and GPUs

Practical Introduction to CUDA and GPU

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

When MPPDB Meets GPU:

High Performance Computing and GPU Programming

Transcription:

CMSC 858M/AMSC 698R Fast Multipole Methods Nail A. Gumerov & Ramani Duraiswami Lecture 20 Outline Two parts of the FMM Data Structures FMM Cost/Optimization on CPU Fine Grain Parallelization for Multicore Nodes Graphics Processors (GPUs) Fast Matrix-Vector Multiplication on GPU with On-Core Computable Kernel Change of the Cost/Optimization scheme on GPU compared to CPU Other steps Test Results Profile Accuracy Overall Performance Heterogeneous algorithm 1

Publication The major part of this work has been published: N.A. Gumerov and R. Duraiswami Fast multipole methods on graphics processors. Journal of Computational Physics, 227, 8290-8313, 2008. Q. Hu, N.A. Gumerov, and R. Duraiswami Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures, Proceedings of the Supercomputing 11 Conference, Seattle, November 12-18, 2011. (The paper is nominated for the best student paper award). Two parts of the FMM Generate data structure (define the matrix); Perform Matrix-vector product (run); Two different types of problems: Static matrix (e.g. for iterative solution of large linear system): Data structure should be generated once, while run should be performed many times with different input vectors; Dynamic matrix (e.g. N-body problem) Data structure should be generated each time, when matrixvector multiplication is needed. 2

Data Structures We need O(NlogN) procedure to index all the boxes, find children/parent and neighbors; Implemented on CPU using bit-interleaving (Morton-Peano indexing in octree) and sorting; We generate lists of Children/Parent/Neighbors/E4-neighbors and store in global memory; Recently, graduate students Michael Lieberman and Adam O Donovan implemented basic data structures on GPU (prefix sort, etc.) and found of order 10 speedups compared to the CPU for large data sets. The most recent work on data structures on GPU is performed by Hu Qi (up to 100 times speed up compared to the standard CPU). In the present lecture we will focus on the run algorithm, assuming that the necessary lists are generated and stored in the global memory. FMM on CPU 3

Flow Chart of FMM Run Level l max Levels l max -1,, 2 Level 2 Levels 3,,l max 1. Get S-expansion coefficients (directly) 2. Get S-expansion coefficients from children (S S translation) 3. Get R-expansion coefficients from far neighbors (S R translation) 4. Get R-expansion coefficients from far neighbors (S R translation) Start End Level l max 7. Sum sources in close neighborhood (directly) Level l max 6. Evaluate R-expansions (directly) 5. Get R-expansion coefficients from parents (R R translation) Direct Sparse Matrix- Vector Multiplication Approximate Dense Matrix- Vector Multiplication FMM Cost (Operations) Points are clustered with clustering parameter s, means not more that s points in the smallest octree box. Expansions have length p 2 (number of coefficients). There are O(N) points and O(N boxes )=O(N/s) in the octree. O(Np 2 ) O(N boxes p 3 ) O(p 3 ) O(N boxes p 3 ) Level l max Levels l max -1,, 2 Level 2 Levels 3,,l max 1. Get S-expansion coefficients (directly) 2. Get S-expansion coefficients from children (S S translation) 3. Get R-expansion coefficients from far neighbors (S R translation) 4. Get R-expansion coefficients from far neighbors (S R translation) Start End Level l max 7. Sum sources in close neighborhood (directly) Level l max 6. Evaluate R-expansions (directly) 5. Get R-expansion coefficients from parents (R R translation) O(Ns) O(Np 2 ) O(N boxes p 3 ) 4

FMM Cost/Optimization (operations) Since p=o(1), we have TotalCost = AN+BN noxes +CNs =N(A+B/s+Cs) Total Cost Cs s opt =(B/C) 1/2 B/s s opt s Fine Grain Parallelization for Multicore PC s Serial Algorithm Constructor Step 1 Loop 1 Step n Allocate Global Memory Parallel Loop 1 Parallel Algorithm Allocate Constructor 1 Step 1 2 3 4 Step n Global Memory Fork parallelization Open MP C++, Fortran 9x Also auto vectorization of loops (SSE) Loop n Destructor Deallocate Parallel Loop n 1 2 3 4 Destructor Deallocate Cannot write to the same memory address! 5

Some test results for OMP parallelization (4 cores) GPUs 6

A Quick Introduction to the GPU Graphics processing unit (GPU) is a highly parallel, multithreaded, many-core processor with high computation power and memory bandwidth GPU is designed for single instruction multiple data (SIMD) computation; more transistors for processing rather than data caching and flow control A few cores Hundreds cores Control ALU ALU ALU ALU Cache DRAM CPU DRAM GPU NVIDIA Tesla C2050: 1.25 Tflops single 0.52 Tflops double 448 cores Is It Expensive? Any PC has GPU which probably performs faster than the CPU GPUs with Teraflops performance are used in game stations Tens of millions of GPUs are produced each year Price for 1 good GPU in range $200-500 Prices for the most advanced NVIDIA GPUs for general purpose computing (e.g. Tesla C2050) are in the range $1K-$2K Modern research supercomputer with several GPUs can be purchased for a few thousand dollars GPUs provide the best Gflops/$ ratio They also provide the best Gflops/watt 7

Floating-Point Operations for CPU and GPU Is It Easy to Program A GPU? For inexperienced GPU programmers Matlab Parallel Toolbox Middleware to program GPU from Fortran Relatively easy to incorporate to existing codes Developed by the authors at UMD Free (available online) For advanced users CUDA: a C-like programming language Math libraries are available Custom functions can be implemented Requires careful memory management Free (available online) Local memory ~50 kb GPU global memory ~1-4 GB Host memory ~4-128 GB 8

Availability Fortran-9X version is released for free public use. It is called FLAGON. http://flagon.wiki.sourceforge.net/ FMM on GPU 9

Challenges Complex FMM data structure; Problem is not native for SIMD semantics; non-uniformity of data causes problems with efficient work load (taking into account large number of threads); serial algorithms use recursive computations; existing libraries (CUBLAS) and middleware approach are not sufficient; high performing FMM functions should be redesigned and written in CU; Low fast (shared/constant) memory for efficient implementation of translation operators; Absence of good debugging tools for GPU. Implementation Main algorithm is in Fortran 95 Data structures are computed on CPU Interface between Fortran and GPU is provided by our middleware (Flagon), which is a layer between Fortran and C++/CUDA High performance custom subroutines written using Cu 10

Time (s) 11/10/2011 Fast Kernel Matrix-vector multiplication on GPU High performance direct summation on GPU (total) 1.E+04 1.E+03 Computations of potential y=ax 2 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E+02 1.E+01 CPU, Direct y=bx 2 GPU: NVIDIA GeForce 8800 GTX (peak 330 GFLOPS). GPU, Direct 1.E+00 1.E-01 1.E-02 y=bx 2 +cx+d b/a=600 1.E-03 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 Number of Sources Estimated achieved rate: 190 GFLOPS. CPU direct: serial code; no use of partial caching; no loop unrolling; (simple execution of nested loop) 11

Receiver Data Thread registers Algorithm for direct summation on GPU (Kernel matrix multiplication) 1 2 n Global GPU memory Block of Data Shared Coalesced (fast) read to thread register (fast memory) Source Data Global GPU memory Coalesced (fast) copy to the shared (fast) memory Compute and add contribution of the j-th source to the sum handled by the i-th thread (performed by the i-th thread) i-th thread Write result to global GPU memory Block of Data Block of threads Change of Cost/Optimization on GPU 12

Time (s) 11/10/2011 Algorithm modification for direct summation in the FMM 1. Each block of threads executes computations for one receiver box (Blockper-box parallelization). 2. Each block of threads accesses the data structure. 3. Loop over all sources contributing to the result requires partially non-coalesced read of the source data (box by box). Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU 1.E+02 1.E+01 Sparse Matrix-Vector Product s max =60 y=bx CPU: Time=CNs, s=8 -lmax N 1.E+00 1.E-01 y=ax 2 CPU 4 l max =5 y=cx GPU: Time=A 1 N+B 1 N/s+C 1 Ns 1.E-02 1.E-03 2 3 GPU b/c=16 y=8 -lmax dx 2 +ex+8 lmax f read/write float computations 3D Laplace 1.E-04 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware 13

Optimization Sparse M-V product Time Dense M-V product Time Time CPU: + = 0 s 0 s 0 s opt s Time Time Time GPU: + = 0 s1 0 0 s s s 1 s opt s Direct summation on GPU (final step in the FMM) Compare GPU final summation complexity: and total FMM complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, and this can be only increased for the full algorithm, since its complexity Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2. 14

Time (s) 11/10/2011 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU 1.E+03 1.E+02 Sparse Matrix-Vector Product s max =320 y=bx 1.E+01 y=ax 2 CPU l max =4 y=cx 1.E+00 3 1.E-01 1.E-02 2 GPU b/c=300 1.E-03 y=8 -lmax dx 2 +ex+8 lmax f 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU and GPU N Serial CPU (s) GPU(s) Time Ratio 4096 3.51E-02 1.50E-03 23 8192 1.34E-01 2.96E-03 45 16384 5.26E-01 7.50E-03 70 32768 3.22E-01 1.51E-02 21 65536 1.23E+00 2.75E-02 45 131072 4.81E+00 7.13E-02 68 262144 2.75E+00 1.20E-01 23 524288 1.05E+01 2.21E-01 47 1048576 4.10E+01 5.96E-01 69 15

Other steps of the FMM Important conclusion: Since the optimal max level of the octree when using GPU is lesser than that for the CPU, the importance of optimization of translation subroutines diminishes. 16

Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction): 30-60. Test results 17

Profile Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used. 18

L2-relative error 11/10/2011 What is more accurate for solution of large problems on GPU: direct summation or FMM? 1.E-02 1.E-03 1.E-04 1.E-05 Error in computations of potential p=4 FMM p=8 FMM FMM Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. 1.E-06 p=12 Possible reason why GPU the GPU error in direct 1.E-07 Direct summation grows: CPU systematic roundoff 1.E-08 error in computation of function 1/sqrt(x). Filled = GPU, Empty = CPU (still a question). 1.E-09 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Performance N=1,048,576 (potential only) serial CPU GPU Ratio p=4 22.25 s 0.683 s 33 p=8 51.17 s 0.908 s 56 p=12 66.56 s 1.395 s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU 28.37 s 0.979 s 88.09 s 1.227 s 116.1 s 1.761 s Ratio 29 72 66 19

Run Time (s) Run Time (s) Run Time (s) 11/10/2011 Performance p=4 p=8 p=12 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=cx 1.E+01 1.E+00 CPU, Direct CPU, FMM y=dx 1.E+01 1.E+00 CPU, Direct CPU, FMM GPU, Direct y=dx 1.E+01 1.E+00 CPU, Direct CPU, FMM y=bx 2 y=dx GPU, FMM GPU, Direct 1.E-01 1.E-01 1.E-01 1.E-02 GPU, FMM a/b = 600 c/d = 30 1.E-02 GPU, FMM a/b = 600 c/d = 50 1.E-02 GPU, Direct a/b = 600 c/d = 50 p=4, FMM error ~ 2 10-4 1.E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources p=8, FMM error ~ 10-5 1.E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources p=12, FMM error ~ 10-6 1.E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Performance FMM GPU Computations of the potential and forces: Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range 25-50 Teraflops are observed (following the citation below). dir CPU FMM M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, 1997. direct 20

Heterogeneous Algorithm ODE solver: source receiver update GPU work Single node algorithm CPU work particle positions, source strength data structure (octree and neighbors) source S-expansions translation stencils local direct sum upward S S downward S R R R receiver R-expansions final sum of far-field and near-field interactions time loop 21

Advantages CPU and GPU are tasked with the most efficient jobs Faster translations: CPU code can be better optimized which requires complex translation stencil data structures Faster local direct sum: many cores on GPU; same kernel evaluation but on multiple data (SIMD) The CPU is not idle during the GPU-based computations High accuracy translation without much cost penalty on CPU Easy visualization: all data reside in GPU RAM Smaller data transfer between the CPU and GPU Single node tests 22

The billion size test case Using all 32 Chimera nodes Timing excludes the partition 1 billion runs potential kernel in 21.6 s Actual performance: 10-30 TFs To compute without FMM for the same time: 1 EFs performance 23