Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

Size: px

Start display at page:

Download "Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009"

Quentin Booth
5 years ago
Views:

1 Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009 Dan l Pierce, PhD, MBA, CEO & President AAI Joint with: Yukai Hung, Chia-Chi Liu, Yao-Hung Tsai, Weichung Wang, and David Yu Access Analytics Int l, LLC P.O. Box 981 Redmond, WA 98073

2 Outline Motivation Problem Software BCSLIB-EXT Algorithm Overview GPU Implementation Highlights Timing results 21-Jan-09 2

3 Motivation Finite Element Analysis of Structures Displacements given a set of loads Ax = b Factor A = LDL T Modes and mode shapes Kx = λmx Solve for specific intervals Block Shift Inverted Lanczos Repeated solves for (K σm) x = b Factor (K σm) = LDL T

4 Motivation Common computational workhorse in FEA commercial products is BCSLIB-EXT Typically ~80% of time in commercial code is spent in BCSLIB-EXT & similar codes. BCSLIB-EXT Developed by The Boeing Company Acquired by Access Analytics Int l in 2007 All work presented is on BCSLIB-EXT and represents a new product BCSLIB-GPU Inexpensive academic licensing available for BCSLIB-EXT and BCSLIB-GPU Real commercial software

6 Algorithm Overview BCSLIB-EXT implements a sparse symmetric indefinite multifrontal factorization. A direct complete matrix factorization. Most easily understood as a tree based method with dense partial factorizations performed at each node in the tree. No significant parts of algorithm are embarrassingly parallel.

7 Algorithm Overview

8 Algorithm Overview A 1 A 2 A 3

9 Algorithm Overview Leaf node gets nonzeroes from columns and performs partial factorization. Nonleaf node gets nonzeroes from columns and updates from children and performs partial factorization. A A A L ~ A L A A A A 21 T D L 22 T L D L T L L 1 21 D 21 L T 21 0D I 0 Factor Scale 0 ~ A 22 L 0 T L T 21 I Update (Schur-Complement)

10 Algorithm Overview - Review Assemble a matrix at tree node Get matrix elements from original matrix Get update elements from children Perform partial factorization Factor Scale Update (this will get passed up to parent) Since matrix is symmetric we only use lower triangular portion Intermediate matrices to be partially factored are called Frontal Matrices (Hence Multi-Frontal ) 21-Jan-09 10

11 GPU Implementation Only those frontal matrices that are large enough are factored on the GPU Transfer frontal matrix to GPU Factor and Scale Done together with a left looking factorization 64 columns at a time Parallelism is along the columns then down the rows of the matrix Update Note this is just an outer product matrix multiplication Blocked to take advantage of shared memory Performed after each 64 columns of the factorization Pass back to CPU 21-Jan-09

12 Factor and Scale for j 1: n { v(i : j-1) L(j, 1: j-1)d(1: j-1) v(j) A(j, j) L(j,1: j 1)v(1: j 1) d(j) v(j) L(j 1: n, j) (A(j 1: n, j) L(j 1: n,1: j 1)v(1: j 1))/v(j) } 21-Jan-09 12

13 Update ~ A T 22 A22 L21D L 21 40% of computation in this kernel 16 X 16 blocks in shared memory % penalty cost in kernel for using full grid 21-Jan-09 13

14 Data Structures & Bank Conflicts Frontal matrix compact triangular storage Results in bank conflicts Dense panel for Factor and Scale Dense panel used in Update Additional overall 15% speedup with dense panel 21-Jan-09 14

15 Additional Issues Some frontal matrices are very large Order 10K to 20K Can t fit on CPU Assembled on GPU directly Compact storage is currently used 21-Jan-09 15

16 Performance Matrix order CPU Time GPU Time Speedup (secs) (secs)

17 Ongoing developments Have whole factorization in a single 2d storage Update panels separately from updating trailing triangular update portion Exploit more non compact memory storage Reduce bank conflicts Compute factorization of diagonal block Overlap data movement with multiple streams Overlap CPU and GPU computations 21-Jan-09 17

18 Thanks Any questions or interest to use BCSLIB- GPU or BCSLIB-EXT Contact us 21-Jan-09 18

Accelerating MCAE with GPUs

Accelerating MCAE with GPUs Information Sciences Institute 15 Sept 2010 15 Sept 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com Report Documentation