Terascale on the desktop: Fast Multipole Methods on Graphical Processors

Size: px

Start display at page:

Download "Terascale on the desktop: Fast Multipole Methods on Graphical Processors"

Gilbert Walters
5 years ago
Views:

1 Terascale on the desktop: Fast Multipole Methods on Graphical Processors Nail A. Gumerov Fantalgo, LLC Institute for Advanced Computer Studies University of Maryland (joint work with Ramani Duraiswami) This work has been supported by NASA Presented on September 25, 2007 at Numerical Analysis Seminar, UMD, College Park

2 Outline Introduction General purpose programming on graphical processors (GPU) Fast Multipole Method (FMM) FMM on GPU Conclusions

3 Introduction Large computational tasks Moore s law Graphical processors (GPU)

domain for a range of frequencies (e.g. 200 frequencies from 20 Hz to 20 khz).

4 Large problems. Example 1: Sound scattering from complex shapes. Problem: Boundary value problem for the Helmholtz equation in complex 3D domain for a range of frequencies (e.g. 200 frequencies from 20 Hz to 20 khz). Mesh: elements vertices KEMAR Mesh: elements, vertices

5 Large problems. Example 2: Stellar dynamics. Problem: Compute dynamics of star cluster (Solve large system of ODE s). Info: A galaxy like Milky Way has 100 millions stars and evolves for billions years.

6 Large problems. Example 3: Imaging (Medical, Geo, Weather, etc.), Computer Vision and Graphics. Problems: 3D, 4D (and more D) interpolation of scattered data; Discrete transforms; Data compression and representation. Much more

7 Moore s law In 1965, Intel cofounder Gordon Moore saw the future. His prediction, now popularly known as Moore's Law, states that the number of transistors on a chip doubles about every two years. Other versions: Every 18 months, Every X months.

8 Some other laws Wirth s law : Software is decelerating faster than hardware is accelerating. Gates law: The speed of commercial software generally slows by fifty percent every 18 months. (never formulated explicitly by Bill Gates, but rather relates to Microsoft products).

9 Graphical processors (GPU) GPU RSX at 550MHz 1.8 teraflop floating point performance

Graphical processors (GPU) NVIDIA GeForce 8800 GTX July

10 The NVIDIA Tesla C870 GPU Computing processor and D870 Deskside Supercomputer will be available in October Graphical processors (GPU) NVIDIA GeForce 8800 GTX July 2007 $500 Memory GFLOPS 8800 GTX 768 MB 330 C870 Tesla 1.5 GB 500

11 General purpose programming on GPU (GP GPU) Challenges Programming languages NVIDIA and CUDA Math libraries (CUBLAS, CUFFT) Our middleware library Examples of programming using the middleware library

12 Challenges SIMD (single instruction multiple data) semantics; common algorithms should be mapped to SIMD; heavy degradation of performance for operations with structured data; optimal sizing dependent on the GPU architecture (e.g. 256 threads per block, 32n blocks, 8 processors, 16 multiprocessors); Lack of high level language compilers and environments friendly to scientific programming; a scientific programmer should take care on many issues on low programming level; difficult debugging; Uncommon for scientific programming elementary computing: accuracy and error handling non-compliant with the IEEE standards; native single precision float computations; lack of many basic math functions; Substantial difference in speeds to access different memory types; low local (cache or fast (shared)) memory; algorithms should be redesigned to take into account several different types of memory (size and access speed); A skillful programming is needed to realize all potential of the hardware (learning, training, experience, etc.);

13 Programming languages OpenGL, ActiveX, Direct3D, ; Cg; Cu.

14 Compute Unified Device Architecture (CUDA) of NVIDIA Ideology: Main flow of the algorithm is realized on CPU with conventional C or C++ programming; High performance functions are implemented in Cu and precompiled to objects (analog of dynamically linked libraries); API is used to communicate with GPU: Allocate/deallocate GPU global memory; transfer data between the GPU and CPU; manage global variables on GPU; call global functions implemented on CU; Several high performance functions, including FFT and BLAS functions are implemented and are callable as library functions.

15 Math libraries (CUBLAS and CUFFT) Callable from C, C++; CUBLAS is supplied by a Fortran wrapper; CUFFT can be also easily wrapped, e.g. to be called from Fortran.

16 Our middleware library (Main concepts) Global algorithm instructions should be executed on CPU, while data should reside on GPU (minimize data exchange between the CPU and GPU); Variables allocated in global GPU memory should be transparent for CPU (easy dereferencing, access to parts of variables); Easy handling of different types, sizes, and shapes of variables; Short functions headers, similar to those in high level languages (C, Fortran); Use in full extend CUBLAS, CUFFT, and other future math libraries; Possibility for easy writing of new custom functions and including to the library.

17 Realization Use structures available in C and Fortran 9x (encapsulation); Use CUDA; Wrap some CUDA and CUBLAS/CUFFT functions; Use customizable modules in Cu for easy modification of existing or writing new functions.

18 Example of fast GPU zation of Fortran 90 subroutine using middleware library Original fortran subroutine is from pseudospectral plasma simulation program of William Dorland

19 Example: Performance of a code generated using the middleware library 1.E+02 y=ax 2 y=bx 2 Time (s) 1.E+01 1.E+00 CPU GPU a/b=25 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E-01 1.E-02 2D MHD Simulations (100 Time Steps in Pseudospectral Method) 1.E+01 1.E+02 1.E+03 1.E+04 N (equivalent grid (NxN)) GPU: NVIDIA GeForce 8800 GTX.

20 Fast Multipole Method (FMM) About Algorithm Data structures Translation theory Complexity and optimizations Example applications

21 About the FMM Introduced by Rokhlin & Greengard (1987,1988) for computation of 2D and 3D fields for Laplace Equation; Reduces complexity of matrix-vector product from O(N2) to O(N) or O(NlogN) (depends on data structure); Hundreds of publications for various 1D, 2D, and 3D problems (Laplace, Helmholtz, Maxwell, Yukawa Potentials, etc.); We taught the first in the country course on FMM fundamentals & application at the University of Maryland (2002); Our reports on fundamentals of the FMM and lectures are available online (visit our web pages).

22 About the FMM Problem: Compute matrix-vector product Some kernels Laplace 3D: Helmholtz 3D: Gaussian nd:

23 Major principle: Use expansions. Theorem 1. The field of a single source located at x 0 can be locally expanded about center x * into absolutely and uniformly convergent series in domain y-x * <r<r< x 0 -x * (local, or R-expansion). Corollary: holds for the field of s sources located outside the sphere x i -x * >R. Theorem 2: The field of a single source located at x 0 can be expanded about center x * into absolutely and uniformly convergent series in domain y-x * >R>r> x 0 -x * (multipole, or S-expansion). Corrolary: holds for the field of s sources located inside the sphere x i - x * <r. Theorem 3: R- and S-expansions can be translated (change of the basis and/or the expansion center, subject to geometric constraints). E.g. for Laplace kernel:

24 FMM algorithm Computational domain (nd cube) is partitioned by quadtree (2D) or octree (3D). Upward pass (get S-expansions for all boxes (skip empty)) (S-expansion means singular, or multipole, or far field expansion) 2. Get S-exp for other levels 1. Get S-exp for Max Level (use S S-translations) y x i x c (n,l)

25 FMM algorithm Downward pass (get R-expansions for all boxes (skip empty)) (R-expansion means regular, or local, or near field expansion) 1. Get R-exp from S-exp s of the boxes in the neighborhood (use S R-translations) 2. Get R-exp from parent (use R R-translations)

26 FMM algorithm Final evaluation (evaluate R-expansions for boxes at Max Level) and sum up directly contributions of sources in the neighborhood of receivers ) 1. Evaluate R-exp 2. Direct summation y j y j

27 Data structures Binary, quad- or octrees (1,2, and 3D); Determines MaxLevel based on clustering parameter; Requires building of lists of neighbors and neighborhood data structure (e.g. for fast determination of boxes in the neighborhood of the parent box); Requires indexing of sources and receivers in boxes; We do all this on serial CPU using bit interleaving technique, sorting, and operations on sets (union, intersection, etc.); Overall complexity of our algorithm O(NlogN); Normally the complexity of generation of data structure is lower than for the run part of the algorithm; Additional amortization: in many problems the data structure should be set once, while the run part can be executed many times (iterative solution of linear system); It is a non-trivial task to parallelize this algorithm while we expect to perform this task in closest future.

28 Translation theory Standard translation method: Apply p 2 x p 2 matrix to p 2 vector of expansion coefficients: O(p 4 ) complexity; There exist O(p 3 ) methods: Currently we use the RCR-decomposition of the translation operators (Rotation-Coaxial Translation-Back Rotation); Sparse matrix decomposition (a bit slower, but less local memory); There exist O(p 2 ) or O(p 2 logp) methods based on diagonal forms of translation operators: Greengard-Rokhlin exponential forms for truncated conical domain (require some O(p 3 ) transforms); FFT-based methods (large asymptotic constants); Our own O(p 2 ) diagonal form method (problems with numerical stability, especially for single precision); Diagonal forms require larger function representations (samplings on grid) than spectral expansions, and effective asymptotic constants are larger than for the RCR); For relatively low p, which is sufficient for single precision (p=4,,12), the RCR-method is comparable in speed with the fastest O(p 2 ) methods.

29 Translation theory (RCR-decomposition) From the group theory follows that general translation can be reduced to x z y p 4 x p 3 p 3 z y y p 3 z y z x x

30 Translation theory Reduction of translation complexity by using translation stencils and variable truncation number: e.g. S-expansions from the shaded boxes can be translated to the center of the parent box, with the same error bound as from the white box to the child. Also for each box its own truncation number can be used.

31 Complexity of the FMM Operations, which do not depend on the number of boxes: (Generation of S-expansions and evaluation of R-expansions) Complexity: ~ AN (assume that the number of sources and receivers is of the same order). Translation operations, which do not depend on the number of sources/receivers (only on the number of boxes) Complexity: BN boxes ~ BN/s (s is the clustering parameter). Direct summation, depends on the number of sources/receivers and the number of sources in the neighborhood of receivers. Complexity: ~ CNs. Total complexity: Cost = Cost exp +Cost trans +Cost dir ~ AN+BN/s+CNs.

32 Optimization of the FMM (Uniform distributions): Total complexity: Cost(s)=AN+BN/s+CNs. s opt = (B/C) 1/2. Optimal Max Level of the octree: l max = log 8 (N/s opt )

33 Example Applications Boundary Element Method accelerated by GMRES/FMM solver Potential external Dirichlet and Neumann problems 1000 randomly oriented ellipsoids 488,000 vertices and 972,000 elements

34 Example Applications 1.E+04 1.E+03 BEM Dirichlet Problem for Ellipsoids (3D Laplace) y=cx 3 y=bx 2 Total CPU time (s) 1.E+02 1.E+01 1.E+00 1.E-01 y=ax O(N 2 ) Memory Threshold GMRES+FMM GMRES+Low Mem Direct GMRES+High Mem Direct LU-decomposition 1.E-02 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Number of Vertices, N FMM: p = 8

35 Example Applications Single Matrix-Vector Multiplication CPU Time (s) 1.E+02 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 Number of GMRES Iterations Direct+ Matrix Entries Computation y=bx 2 Multiplication of Stored Matrix 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 Number of Vertices, N FMM y=ax 3D Laplace FMM: p = 8

36 Example Applications Sound pressure kd=0.96 kd=9.6 kd=96 (250 Hz) (2.5 khz) (25 khz) BEM for the Helmholtz equation (fgmres/fmm) Mesh: 132,072 elements, 65,539 vertices

37 FMM on GPU Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance

38 Challenges Complex FMM data structure; Problem is not native for SIMD semantics; non-uniformity of data causes problems with efficient work load (taking into account large number of threads); serial algorithms use recursive computations; existing libraries (CUBLAS) and middleware approach are not sufficient; high performing FMM functions should be redesigned and written in CU; Low fast (shared/constant) memory for efficient implementation of translation operators; Absence of good debugging tools for GPU.

39 High performance direct summation on GPU (total) 1.E+04 1.E+03 Computations of potential y=ax 2 CPU: 2.67 GHz Intel Core 2 extreme QX 7400 (2GB RAM and one of four CPUs employed). 1.E+02 1.E+01 CPU, Direct y=bx 2 GPU: NVIDIA GeForce 8800 GTX (peak 330 GFLOPS). Time (s) 1.E+00 1.E-01 GPU, Direct Estimated achieved rate: 190 GFLOPS. 1.E-02 y=bx 2 +cx+d 1.E-03 1.E-04 b/a=600 3D Laplace 1.E+03 1.E+04 1.E+05 1.E+06 CPU direct: serial code; no use of partial caching; no loop unrolling; (simple execution of nested loop) Number of Sources

40 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU 1.E+02 Sparse Matrix-Vector Product s max =60 y=bx CPU: 1.E+01 Time=CNs, s=8 -lmax N 1.E+00 CPU l max =5 y=cx GPU: Time (s) 1.E-01 1.E-02 1.E-03 2 y=ax GPU b/c=16 y=8 -lmax dx 2 +ex+8 lmax f Time=A 1 N+B 1 N/s+C 1 Ns read/write float computations 3D Laplace 1.E-04 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources access to box data These parameters depend on the hardware

41 Direct summation on GPU (final step in the FMM) Compare GPU final summation complexity: and total FMM complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, and this can be only increased for the full algorithm, since its complexity Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2.

42 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU 1.E+03 1.E+02 Sparse Matrix-Vector Product s max =320 y=bx 1.E+01 y=ax 2 CPU l max =4 y=cx Time (s) 1.E+00 1.E-01 1.E GPU b/c=300 1.E-03 y=8 -lmax dx 2 +ex+8 lmax f 3D Laplace 1.E-04 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources

43 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU and GPU N Serial CPU (s) GPU(s) Time Ratio E E E E E E E E E E E E E E E E E E-01 69

44 Important conclusion: Since the optimal max level of the octree when using GPU is lesser than that for the CPU, the importance of optimization of translation subroutines diminishes.

45 Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction):

46 Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used.

47 What is more accurate for solution of large problems on GPU: direct summation or FMM? L2-relative error 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 1.E-07 1.E-08 1.E-09 Error in computations of potential FMM FMM FMM GPU CPU Direct p=8 p=4 p=12 Filled = GPU, Empty = CPU 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. Possible reason why the GPU error in direct summation grows: systematic roundoff error in computation of function 1/sqrt(x). (still a question).

48 Performance N=1,048,576 (potential only) serial CPU GPU Ratio p= s s 33 p= s s 56 p= s s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU s s s s s s Ratio

49 Performance p=4 p=8 p=12 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=bx 2 y=cx 1.E+02 3D Laplace y=ax 2 y=cx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM GPU, Direct y=dx Run Time (s) 1.E+01 1.E+00 1.E-01 CPU, Direct CPU, FMM y=bx 2 GPU, FMM y=dx 1.E-02 GPU, FMM a/b = 600 c/d = 30 p=4, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources 1.E-02 GPU, FMM a/b = 600 c/d = 50 p=8, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources 1.E-02 GPU, Direct a/b = 600 c/d = 50 p=12, FMM error ~ E-03 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Number of Sources

50 Performance Computations of the potential and forces: Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range Teraflops are observed (following the citation below). M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, GPU direct FMM dir FMM CPU

51 Conclusions What do we have What is next

52 What do we have Some insight and methods of programming on GPU; High performance FMM for 3D Laplace kernel running in configuration 1CPU- 1GPU; Results encouraging to continue.

53 What is next Applications of the FMM matrix-vector multiplier for solution of physics-based and engineering problems (particle/molecular dynamics, boundary element methods, RBF interpolation, etc.); Mapping on GPU algorithms generating FMM data structures; Develop the FMM for larger CPU/GPU clusters and hit big problems; Some research to adjust the algorithms for GPU, particularly more efficient use of shared/constant memory; FMM on GPU for different kernels (we see a lot of applications); Continue to work towards simplification of programming on GPU for scientists; Upgrades in hardware (double precision, larger memory, etc.) are expected.

54 Thank you!

55 Outline Introduction Large computational tasks Moore s law Graphical processors (GPU) General purpose programming on GPU Challenges NVIDIA and CUDA Math libraries (CUBLAS, CUFFT) Middleware libraries Fast Multipole Method (FMM) Algorithm Data structures Translation theory Complexity and optimizations FMM on GPU Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance Conclusions What do we have What is next

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast