Sampling Using GPU Accelerated Sparse Hierarchical Models

Size: px

Start display at page:

Download "Sampling Using GPU Accelerated Sparse Hierarchical Models"

Mercy Tyler
5 years ago
Views:

1 Sampling Using GPU Accelerated Sparse Hierarchical Models Miroslav Stoyanov Oak Ridge National Laboratory supported by Exascale Computing Project (ECP) exascaleproject.org April 9, 28 Miroslav Stoyanov /25

2 Sparse polynomial model Sparse grid, without the grid Given a sequence of -D basis functions φ i(x) : [, ] R for i =,,, (and maybe associated interpolation nodes ξ i) consider all possible d-dimensional tensors: d φ i (x) = φ ik (x k ), φ i (x) : d k=[, ] R k= A sparse polynomial basis is defined by a finite multi-index set Λ N d = {(i, i 2,, i k ) : i k N } A sparse model is any basis combined with a set of coefficients C = {c i } i G R m G Λ,C(x) = c i φ i (x) i Λ The choice of Λ and C is usually made so that G Λ,C(x) G Λ,C(x) f(x), f(x) : d k=[, ] R m For example, any sparse grid constructed form a nested set of nodes and basis functions follows this framework. Miroslav Stoyanov 2/25

3 Sparse polynomial basis X X X X X 2 X X 2 X X 2 X 3 X 4 X 3 X 4 X 3 X 4 X 5 X 6 X 7 X 8 X 5 X 6 X 7 X 8 X 5 X 6 X 7 X 8 Miroslav Stoyanov 3/25

4 Piece-wise constant hierarchy X X X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X X X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 2 X 2 X 22 X 23 X 24 X 25 X 26 Miroslav Stoyanov 4/25

5 Motivation: Exascale Models of Stellar Explosions Fast surrogate models Simulation of neutrino radiation in core collapse supernovae u t + u = u(e, )R(E, E, T, η)n(e )de E The integral has to be evaluated for each cell and the collision kernel R(E, E ) is computed from a separate expensive model. Standard practice is to approximate the kernels model using dense grids... Sparse grids surrogates dramatically reduce construction cost and the memory footprint but evaluations are more expensive than simple table Consider O(, ) evaluations, per-time step, per-discretization cell. Miroslav Stoyanov 5/25

at ORNL produces data in the order of 2GB Data lives in 4D space, which means

6 Motivating: Tomography in 4D Sparse data in high-dimensions There are many advanced image processing techniques for 2D and 3D Spallation Neutron Source at ORNL produces data in the order of 2GB Data lives in 4D space, which means information is very sparse But the data has lots of structure Miroslav Stoyanov 6/25

7 Motivating: Bayesian inference Large number of outputs Bayesian inference p(x D) = L(u(x), D)p(x) u(x) is some model, D is observation data, p(x) is the prior, initial guess regarding x, L(u(x), d) is the likelihood, probability of the difference between u(x) and D, p(x D) is the posterior, informed distribution of x. Analyzing p(x D) is challenging: Markov-Chain Monte Carlo method is needed high-dimensionality, low acceptance rate, irregular structure of p(x D) Brute force solution DiffeRential Evolution Adaptive Metropolis (DREAM) run multiple chains, huge number of batches of samples The model u(x) can have large number of outputs Miroslav Stoyanov 7/25

GPU Acceleration GPU accelerators Nvidia, K2, K4, P, V - massive boost of flops and mops compared to CPUs - better energy efficiency and low cost - massvive concurrency, thousands of

8 GPU Acceleration GPU accelerators Nvidia, K2, K4, P, V - massive boost of flops and mops compared to CPUs - better energy efficiency and low cost - massvive concurrency, thousands of simulteneous operations Challenges in working with accelerators: - massvive concurrency, cannot handle sequential algorithms - memory management, GPU have sepatate limited memory Miroslav Stoyanov 8/25

9 Split evaluations We want the model output at a set of points {x j} n j=, i.e., G Λ,C(x j) = i Λ c i φ i (x j) Let C be the matrix with columns {c i } i Λ, and let B be the matrix Then, we want the answer B = [b i,j ] = φ i (x j) A = CB where A R m n, C R m Λ, and B R Λ n. Matrix C is given, thus we need to compute matrix B and then multiply by C. The splitting approacha allows us to exploit fast linear algebra libraries for GPU computations, e.g., Nvidia: cublas and cusparse University of Tenneessee at Knoxville: Matrix Algebra on GPU and Multicore Architectures (MAGMA) Miroslav Stoyanov 9/25

10 Sparsity discussion Basis functions φ i (x) have local support, which means that there is sparsity in B. Sparcity pattern cannot be predicted for a general multi-index; however, for testing purposes, we consider the multi-index associated with a standard sparse grid, in which case, the number of non-zero entries is where Λ is the number multi-indexes. O(n log d 2( Λ )) Two potential algorithms for constructing B Dense method, compute all n Λ entries, which is very parallelizable Sparse method, convert the multidimensional hierarchial structure (DAG) into a tree and traverse the treee computing only the non-zeros of B Miroslav Stoyanov /25

11 Basis evaluations Dense method parallelizes accros the multi-index each block of CUDA threads is assigned 32 multi-indexes the associated nodes and the basis support is stored in shared memory there is opportunity for some reuse of data Evaluating many functions that are zero, larger memory footprint Sparse method parallelizes accros the number of evaluations N each CUDA threads is assigned a single x j the hierarchy of multi-indexes is converted to a tree each thread independently traverses the tree Lots of uncoalesed memory access, there is no opportunity for reuse Miroslav Stoyanov /25

12 Basis evaluations: 4D 6 5 Basis evaluation 4D K2 Sparse (2.5K batch) K2 Dense K2 Sparse (5K batch) Basis evaluation 4D 6 K4 Sparse (2.5K batch) K4 Dense 5 K4 Sparse (5K batch) Pascal Sparse (2.5K batch) Pascal Dense Pascal Sparse (5K batch) Volta Sparse (2.5K batch) Volta Dense Volta Sparse (5K batch) Miroslav Stoyanov 2/25

13 Basis evaluations: 8D 6 5 Basis evaluation 8D K2 Sparse (2.5K batch) K2 Dense K2 Sparse (5K batch) Basis evaluation 8D 6 K4 Sparse (2.5K batch) K4 Dense 5 K4 Sparse (5K batch) Pascal Sparse (2.5K batch) Volta Sparse (2.5K batch) Pascal Dense Volta Dense 8 Pascal Sparse (5K batch) 8 Volta Sparse (5K batch) Miroslav Stoyanov 3/25

14 Model evaluations: Diverse methods Evaluations 4D, outputs = K2 Sparse-Sparse K2 Dense-Dense K2 Sparse-Dense K2 Dense-Sparse Evaluations 4D, outputs = K4 Sparse-Sparse K4 Dense-Dense K4 Sparse-Dense 8 K4 Dense-Sparse Pascal Sparse-Sparse Volta Sparse-Sparse Pascal Dense-Dense Volta Dense-Dense Pascal Sparse-Dense Pascal Dense-Sparse Volta Sparse-Dense Volta Dense-Sparse Miroslav Stoyanov 4/25

15 Model evaluations: 4D and few outputs Evaluations 4D, outputs = - 28 K2 Sparse - K2 Dense - K2 Sparse - 28 K2 Dense - 28 Evaluations 4D, outputs = - 28 K4 Sparse - K4 Dense - K4 Sparse K4 Dense Pascal Sparse - Volta Sparse - Pascal Dense - Volta Dense Pascal Sparse - 28 Pascal Dense Volta Sparse - 28 Volta Dense Miroslav Stoyanov 5/25

16 Model evaluations: 4D and many outputs Evaluations 4D, outputs = Evaluations 4D, outputs = K2 Sparse K4 Sparse - 24 K2 Dense - 24 K4 Dense K2 Sparse K2 Dense K4 Sparse K4 Dense Pascal Sparse - 24 Volta Sparse - 24 Pascal Dense - 24 Volta Dense Pascal Sparse Pascal Dense Volta Sparse Volta Dense Miroslav Stoyanov 6/25

17 Model evaluations: 8D and few outputs Evaluations 8D, outputs = - 28 K2 Sparse - K2 Dense - K2 Sparse - 28 K2 Dense - 28 Evaluations 8D, outputs = K4 Sparse - K4 Dense - 3 K4 Sparse - 28 K4 Dense Pascal Sparse - Pascal Dense - Pascal Sparse - 28 Pascal Dense Volta Sparse - Volta Dense - Volta Sparse - 28 Volta Dense Miroslav Stoyanov 7/25

18 Model evaluations: 8D and many outputs Evaluations 8D, outputs = K2 Sparse - 24 K2 Dense - 24 K2 Sparse K2 Dense Evaluations 8D, outputs = K4 Sparse - 24 K4 Dense - 24 K4 Sparse K4 Dense Pascal Sparse - 24 Pascal Dense - 24 Pascal Sparse Pascal Dense Volta Sparse - 24 Volta Dense - 24 Volta Sparse Volta Dense Miroslav Stoyanov 8/25

19 Observations Small multi-index sets and few x j favor dense algorithm Large batches favor the sparse algorithm Newer hardware architectures handle the uncoalesed memory access and favor the sparse algorithm High dimensions and complex tree structure favors the dense algorithm High number of outputs washes the difference between the methods Dense algorithm always uses more memory Miroslav Stoyanov 9/25

20 A more mainstream architecture Intel 6-core: i7-393k CPU (Sandy-Bridge-E) coupled with Nvidai GTX 8. Intel 4-core: i7-67k CPU (Skylake) coupled with Nvidai GTX 98ti. Dimensions, level 7, number of multi-indexes.8m, outputs. Stage i7-393k i7-67k GTX 8 GTX 98ti Select multi-indexes 7s 5s Compute coefficients 45s 23s CPU evaluate M 797s 772s GPU evaluate M 28s 24s Miroslav Stoyanov 2/25

21 Results: Neutrino collision kernel model Neutrino opacity kernels for different temperature Using Pascal GPU, evaluations of,, opacity values can be performed in < 2s on Pascal and < 6s on K4. Miroslav Stoyanov 2/25

22 Results: Tomography Original Classical Tomography Multi-index Simplified Shepp-Logan example, using 26 angle measurements. Original Classical Tomography Multi-index Neutron imaging example, using 2 angle measurements. Miroslav Stoyanov 22/25

23 Examples: Bayesian inference Simple model: u(x, x 2) = sin(x πt) + sin(x 2πt), x, x 2 [, 2] Data (assuming no noise): D = sin(5πt) + sin(πt) Conformal likelihood: ( L(x) = exp 3 ) (u(x) D) 2 dt PDF of Model Parameter Exact solution Likelihood Log-likelihood Posterior Miroslav Stoyanov 23/25

24 Tasmanian Toolkit for Adaptive Stochastic Modeling and Non-Intrusive ApproximatioN github.com/ornl/tasmanian tasmanian.ornl.gov current version: 5. Supported interfaces: C/C++, Python, MATLAB/Octave, CLI, Fortran9/95 Miroslav Stoyanov 24/25

25 Tasmanian Supported: Linux, OSX, Windows (VC++) Build system with cmake, install script (bash and batch), GNU-Make BSD License (with UT-Battele clause) No external dependence (good to have CUDA and BLAS) Global polynomial based refinement Large number of global -D rules with different growth 5 Leja-type rules with different growth ASKEY quadrature rules Chebyshev-type -D rules balancing Lebesque constant and number of nodes Several local -D rules arbitrary order piwce-wise polynomials linear and cubic wavelets locally anisotropic refinement approach Main focus is on surrogate modeling (Sparse Grids) DiffeRential Evolution Adaptive Metropolis (DREAM) method Miroslav Stoyanov 25/25

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/