Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics

Size: px

Start display at page:

Download "Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics"

Flora Moore
6 years ago
Views:

1 Scientific Computing on Graphical Processors: FMM, Flagon, Signal Processing, Plasma and Astrophysics Ramani Duraiswami Computer Science & UMIACS University of Maryland, College Park Joint work with Nail Gumerov, Yuancheng Luo, Adam O Donovan, Bill Dorland, Kate Despain Partially supported by NASA, DOE, NSF, UMD, NVIDIA

2 Problem sizes in simulation/assimilation are increasing Change in paradigm in science Simulate then test Fidelity demands larger simulations Problems being simulated are also much more Sensors are getting varied and cheaper; and storage is getting cheaper Cameras, microphones Other Large data Text (all the newspapers, books, technical papers) Genome data Medical/biological data (X-Ray, PET, MRI, Ultrasound, Electron microscopy ) Climate (Temperature, Salinity, Pressure, Wind, Oxygen content, )

3 Need fast algorithms, parallel processing, better software Fast algorithms that improve asymptotic complexity of operations FFT, FMM, NUFFT, preconditioned Krylov iterations Parallel processing can divide the time needed by the number of processors GPUs, multicore CPUs Partitioning problems across heterogeneous computing environments Cloud computing Architecture aware programming Data structures for parallel architectures and cache optimization

4 Fast Multipole Methods Follows from seminal work of Rokhlin and Greengard (1987) General method for accelerating large classes of dense matrix vector products Solve systems, compute eigenvalues etc. in combination with iterative algorithms Allow reduction of O(N 2 ) and O(N 3 ) operations to linear order Dr. Gumerov and I are applying it to many areas Acoustics, Synthetic beamforming Fluid mechanics (vortex methods, potential flow, Stokes flow) Electromagnetic scattering and Maxwell s equations Fast statistics, similarity measures, image processing, segmentation, tracking, learning Non uniform fast Fourier transforms and reconstruction Elastic registration, fitting thin-plate splines

5 Decompose matrix vector product into a sparse part taking care of local interactions FMM replaces pairwise evaluations in dense part with an upward and downward pass via a hierarchy Spatial data structures (octrees), associated lists of particles Source Data Hierarchy MLFMM Evaluation Data Hierarchy N S S S S S S R R R R R M Level 3 Level 5 Level 4 Level 2 Level 2 Level 3 Level 4 Level 5

6 RBF/FMM interpolation to regular spatial grid

Helmholtz equation (some other scattering problems were solved) Performance tests Mesh: 249856

7 Helmholtz equation (some other scattering problems were solved) Performance tests Mesh: vertices/ elements kd=29, Neumann problem kd=144, Robin problem (impedance, sigma=1) Gumerov & Duraiswami, 2006

8 FMM on GPU N.A. Gumerov and R. Duraiswami, Fast multipole methods on graphics processors. Journal of Computational Physics, 227, , N-body problems --- several papers implement on GPU ( but restricted to O(10^5)) To go to O(10 6 ) and beyond we need the FMM Challenges Effect of GPU architecture on FMM complexity and optimization Accuracy Performance

9 Basic FMM flow chart Gumerov & Duraiswami, 2006

10 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for CPU CPU: Time=CNs, s=8 -lmax N b/c=16 GPU: Time=A 1 N+B 1 N/s+C 1 Ns read/write float computations access to box data These parameters depend on the hardware

11 Direct summation on GPU FMM requires a balance between direct summation and the rest of the algorithm Compare GPU final summation complexity: Cost =A 1 N+B 1 N/s+C 1 Ns. and total FMM complexity: Cost = AN+BN/s+CNs. Optimal cluster size for direct summation step of the FMM s opt = (B 1 /C 1 ) 1/2, This leads to Cost =(A+A 1 )N+(B+B 1 )N/s+C 1 Ns, and s opt = ((B+B 1 )/C 1 ) 1/2.

12 Direct summation on GPU (final step in the FMM) Computations of potential, optimal settings for GPU b/c=300

13 Other steps of the FMM on GPU Accelerations in range 5-60; Effective accelerations for N=1,048,576 (taking into account max level reduction):

14 Accuracy Relative L 2 norm error measure: CPU single precision direct summation was taken as exact ; 100 sampling points were used.

15 What is more accurate for solution of large problems on GPU: direct summation or FMM? Error computed over a grid of 729 sampling points, relative to exact solution, which is direct summation with double precision. Possible reason why the GPU error in direct summation grows: systematic roundoff error in computation of function 1/sqrt(x). (still a question).

16 Performance N=1,048,576 (potential only) serial CPU GPU Ratio p= s s 33 p= s s 56 p= s s 48 N=1,048,576 p=4 p=8 p=12 (potential+forces (gradient)) serial CPU GPU s s s s s s Ratio

17 Performance p=4 p=8 p=12

Performance FMM Computations of the potential and forces: GPU Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range 25-50 Teraflops are

18 Performance FMM Computations of the potential and forces: GPU Peak performance of GPU for direct summation 290 Gigaflops, while for the FMM on GPU effective rates in range Teraflops are observed (following the citation below). dir FMM CPU M.S. Warren, J.K. Salmon, D.J. Becker, M.P. Goda, T. Sterling & G.S. Winckelmans. Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, Bell price winning paper at SC 97, direct

19 Introduction GPUs are great as all the previous talks have said But require you to program in extended version of C Need NVIDIA toolchain What if you have an application that is In Fortran 9x/2003, Matlab, C/C++ Too large to fit on the GPU and needs to use the CPU cores, MPI, etc. as part of a larger application, but take advantage of GPU Offload computations which have good speedups on the GPU to it using library calls in your programming environment Enter the FLAGON An extensible open source library and a middleware framework that allows use of GPU Implemented currently for Fortran-9X, and preliminarily for C++ and MATLAB

20 Programming on the GPU GPU organized as 2-30 groups of multiprocessors (8 relatively slow processors) with small amount of own memory and access to common shared memory Factor of 100s difference in speed as one goes up the memory hierarchy To achieve gains problems must fit the SPMD paradigm and manage memory Fortunately many practically important tasks do map well and we are working on converting others Image and Audio Processing Some types of linear algebra cores Many machine learning algorithms Research issues: Identifying important tasks and mapping them to the architecture Making it convenient for programmers to call GPU code from host code Local memory ~50kB GPU shared memory ~1GB Host memory ~2-32 GB

21 Approach to use GPU: Flagon Middleware Programming from higher language on CPU (Fortran / C++/Matlab) Defines Module/Class that provides pointers on CPU to Device Variables on the GPU Execute small, well written, CU functions to perform primitive operations on device avoid data transfer overhead Provide wrappers to BLAS, FFT, and other software (random number, sort, screen dump, etc.) Allow incorporation of existing mechanisms for doing distributed programming (OpenMP, MPI, etc.) to handle clusters Allow relatively easy conversion of existing code

22 Sample scientific computing applications Radial basis function fitting Plasma turbulence computations Fast Multipole Force calculation in particle systems Numerical Relativity Signal Processing Integral Equations

23 FLAGON Framework Fortran Layer Device Variables (devvar) communicates with lower levels Fortran interfaces and wrappers pass parameters to C/C++ level May directly call CUBLAS/ CUFFT library functions C/C++ Layer Communicates with CUDA kernels Setup function calls, parameter passing to kernels Module management of external functions CUDA Layer Performs operations on the device Fortran Level Fortran - C Wrappers/Interfaces C/CUDA FLAGON CUBLAS/CUFFT Functionality Device Kernels

24 FLAGON Principles Build a module/class that defines device variables, and host pointers to them, allows their manipulation via functions and overloaded FORTRAN 95 operators Extensible via CUDA kernels that work with module Use external CUDA kernel loaders and generic kernel callers Efficient memory management Data is stored on the device and managed by the host Asynchronous operations continuously performed on the device Minimizes data transfers between host and device Integrated Libraries CUBLAS/CUFFT CUDPP Some new linear algebra cores, small FFT code, random numbers

25 FLAGON Device Variables User instantiates device variables in Fortran Encapsulates parameters and attributes of the data structure transferred between host and device Tracks (via pointers) allocated memory on the device Stores data attributes (type and dimensions) on the host and device FLAGON Structure devvar Device Pointer Device Data Type Device Status Device Dimensions Device Leading Dimensions Pointer to device memory address Data type stored on device Allocation status on device X, Y, Z dimensions of vector or matrix on host X, XY L L leading dimensions of vector or matrix on device

26 FLAGON Work-Cycle Compiling and link library to user Fortran code Load library into memory Allocate device variables and copy host data to device Work-cycle allows subsequent computations to be performed solely on the device Data transfer from device to host when done Discard/free data on the device FLAGON Work Cycle Load FLAGON Library Allocate Device Variable(s ) Memory Transfer Host to Device Work Memory Transfer Device to Host Specify GPU device, load CUBLAS library Allocates and pads memory on GPU Device Transfer host data from Fortran to CUDA global memory Call CUBLAS, CUFFT, CUDPP, CUDA functions and perform all calculations on the GPU Transfer data back from device to host

27 FLAGON Functions Initialization functions open_devobjects, close_devobjects Memory functions Allocation/deallocation allocate_dv(chartype, nx, ny, nz) deallocate_dv(devvar) Memory transfer transfer_[i, r, c]4(hostvar, devvar, c2g) transfer_[i, r, c] (hostvar, devvar, c2g) Memory copy copy(devvar1,devvar2) function clonedeepwdata(devvara) function clonedeepwodata(devvara) Misc. swap(devvar1, devvar2) part(devicevariable,i1,i2,j1,j2,k1,k2) get_[i, s, c] set_[i, s, c] Point-wise Functions Arithmetic devf_[hadamardf, divide, addition, subtraction] (devvar3, devvar1, devvar2, option) Scaling devf_[i,s,c]scal(devicevariable, a, b), devf_cscalconj(devicevariable, a, b) Misc. devf_zeros(devicevariable), devf_conjugate(devicevariable), devf_partofcmplx(whichpart,devicevariable) CUBLAS Functions: BLAS 1, BLAS 2, BLAS 3 (with shorter call strings) CUFFT Functions: FFT Plans devf_fftplan(devvariable, fft_type, batch) devf_destroyfftplan(plan) FFT Functions devf_fft(input, plan, output) devf_bfft(input, plan, output) devf_ifft(input, plan, output) devf_fftr2c(input, plan, output) devf_fftc2r(input, plan, output) CUDPP Functions: devf_anccudppsortscan(devvarin, devvarout, operation, datatype, algorithm, option) devf_anccudppsortsimple(devvarin, devvarout) Ancillary Functions: devf_ancmatrixtranspose(devvarin, devvarout) devf_ancbitonicsort(devvar1)

28 Example of code conversion

29 Plasma turbulence computations spectral code, solved via a standard Runge-Kutta time advance, coupled with a pseudo-spectral evaluation of NL terms. Derivatives are evaluated in k space, while multiplications in Eq. (2) are carried out in real space. standard 2/3 rule for dealiasing is applied, and small hyperviscous damping terms are added to provide stability at the grid scale. results agree with analytic expectations and same on both CPU & GPU. 32x speedup!

30 Device memory Multi-processors screen camera 64 microphone spherical array Forms an audio camera

31 Audio Camera spherical array of microphones Use beamforming algorithms we developed can find sounds coming from particular directions Run several beamformers, one look direction and assign output to an Audio pixel Compose audio image. E Transform the spherical array into a camera for audio images l Requires significant processing to e form pixels from all directions in a v frame before the next frame is ready a ti o n θ Azimuth Azimuth φ

32 O Donovan et al. : Several papers in IEEE CVPR, IEEE ICASSP, WASPAA ( )

33 Plasma Computations via PIC

34 Data structures for coalesced access Particles modeling a density or real particles Right hand side of evolution equation controlled by a PDE for field solved on a regular grid Either spectrally or via finite differences Before/After time step require interpolation of field quantities at grid nodes to/from particles Organized particles in a box using octrees created via bit interleaving resulting in a Morton curve layout Update procedures at the end of each time step George Stantchev, William Dorland, Nail Gumerov Fast parallel particle-to-grid interpolation for plasma PIC simulations on the GPU, J. Parallel Distrib. Comput., 2008

35 Numerical relativity Beginning collaboration with Prof. Tiglio's group Hope to report more later

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

FMM implementation on CPU and GPU Nail A. Gumerov (Lecture for CMSC 828E) Outline Two parts of the FMM Data Structure Flow Chart of the Run Algorithm FMM Cost/Optimization on CPU Programming on GPU Fast