Parallel multi-frontal solver for isogeometric finite element methods on GPU

Parallel multi-frontal solver for isogeometric finite element methods on GPU Maciej Paszyński Department of Computer Science, AGH University of Science and Technology, Kraków, Poland email: paszynsk@agh.edu.pl Collaborators: Maciej Woźniak Krzysztof Kuźnik AGH University of Science and Technology, Kraków, Poland Victor Calo King Abdullah University of Science and Technology, Thuwal, Saudi Arabia David Pardo The University of the Basque Country, Basque Center for Applied Mathematics and IKERBASQUE (Basque Foundation of Science), Bilbao, Spain

AGH UNIVERSITY OF SCIENCE AND TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE, KRAKOW, POLAND

OUTLINE 1. Introduction 2. Graph grammar model for 1D Grammar productions expressing the generation of element elimination tree Grammar productions expressing the solver algorithm 3. Numerical results for 1D Execution time for p=1,2,3,4,5 Comparision with MUMPS for p=1,2,3,4,5 4. Generalization to 2D 5. C p-1 vs C 0 6. Numerical results for 2D Execution time for p=1,2,3 Comparision with MUMPS for p=1,2,3 7. Conclusions

INTRODUCTION B-SPLINES BASED FINITE ELEMENT METHOD Strong formulation d dx A x d u d x B x u0 0 d u 1 A 1 d x u1 d u d x C x u f x Weak formulation Find u V u H 1 0,1:u 0 0s.t. bv,u lv, v V v H 1 0,1:v 0 0 1 bv,u Ax d v d u d x d x B x v x d u d x C x v x u x d x v1u 1 0 lv v 1

INTRODUCTION B-SPLINES BASED FINITE ELEMENT METHOD Using B-splines as basis functions Find u V u H 1 0,1:u 0 0s.t. bv,u lv, v V v H 1 0,1:v 0 0 u i x N i,p x d v i x N j,p i bn j,p x,n i,p x a i ln j,p x, j x Linear B-splines contribution of b(n2,1;n3,1)

GRAPH GRAMMAR GENERATION OF 1D ELIMINATION TREE 1D elimination tree obtained by executing productions (P1)-(P2) 2 -(P2) 2 -(P3) 6

GRAPH GRAMMAR PRODUCTIONS AS ATOMIC TASKS We assign indices to grammar productions in order to localize the places where the graph grammar productions were fired (P1)-(P2) 1 -(P2) 2 -(P2) 3 -(P2) 4 -(P3) 1 -(P3) 2 -(P3) 3 -(P3) 4 -(P3) 5 -(P3) 6

TRACE THEORY BASED SCHEDULER Alphabet: A = {(P1), (P2) 1, (P2) 2, (P2) 3, (P2) 4, (P3) 1, (P3) 2, (P3) 3, (P3) 4, (P3) 5, (P3) 6 } Dependency relation for construction of the elimination tree (P1)D{(P2) 1,(P2) 2 } (P2) 1 D{(P2) 3,(P2) 4 } (P2) 3 D{(P3) 1,(P3) 2 } (P2) 4 D{(P3) 3,(P3) 4 } (P2) 2 D{(P3) 5,(P3) 6 }

Dependency graph TRACE THEORY BASED SCHEDULER

TRACE THEORY BASED SCHEDULER (P1)-(P2) 1 -(P2) 2 -(P2) 3 -(P2) 4 - (P3) 1 -(P3) 2 -(P3) 3 -(P3) 4 -(P3) 5 -(P3) 6 Foata Normal Form 1 1 1 2 2 2 n n n a a... a a a... a... a a... a a 1 k i 2 A k i, j l k i 1 1 k k 1,..., lk ai Ia j k 1 k 1,..., l j 1,..., l a Da k 2 (alphabet) l 1 1 2 k 1 l n i<>j where I=AxA\D i j Scheduling according to Foata Normal Form: [(P1)][(P2) 1 (P2) 2 ][(P2) 3 (P2) 4 (P3) 5 (P3) 6 ][(P3) 1 (P3) 2 (P3) 3 (P3) 4 ] Thus, the execution of the solver consists of several steps, where independent tasks are executed in concurrent, interchanged with the synchronization barriers.

GRAMMAR BASED NUMERICAL INTEGRATION b N j,p x,n i,p x 1 Ax d N x j,p d x 0 N j,p 1N i,p 1 x ln i,p N i,p 1 1 0 d N i,p x d x Bx N j,p x d N x i,p d x Cx N j,p x N i,p x d x using Gaussian quadrature the integration over the domain can be substituted by a weighted summation over Gauss points l d N x i,p A x w l A x l d x d N j,p x d x d N x i,p l d x d N j,p d x Bx d N x i,p d x x l B x l d N i,p N j,p d x x C x l N j,p x N i,p x l C x N j,p x l N i,p x d x x l N j,p x l

GRAMMAR BASED NUMERICAL INTEGRATION

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Generation of frontal matrices at leaves of the eliminaton tree expressed as the execution of graph grammar productions (A1)-(A) 4 -(AN)

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar productions generating local frontal matrices for left boundary, interior and right boundary nodes for linear B-splines

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar productions merging element frontal matrices at parent level

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar production eliminating fully assembled row at parent level

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar production for merging element frontal matrices at root level Graph grammar production for solution at root level

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar production for recursive backward substitution

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Expression of the solver execution by graph grammar productions (A1)-(A) 4 -(AN) (generation of frontal matrices at leaves of the elimination trees) (A2) 3 (merging contributions at father nodes) (E2) 3 (elimination of fully assembled nodes) (A2) (E2) (merging at parent node followed by elimination) (Aroot) (Eroot) (merging at root node followed by full forward elimination) (BS) 4 (backward substitutions)

Alphabet: TRACE THEORY BASED SCHEDULER A={(A1), (A) 1, (A) 2, (A) 3, (A) 4, (AN), (A2) 1, (A2) 2, (A2) 3, (E2) 1, (E2) 2, (E2) 3, (A2) 4, (E2) 4, (Aroot), (Eroot), (BS) 1, (BS) 2, (BS) 3, (BS) 4 } Dependency relation for the solver algorithm {(A1),(A) 1 }D(A2) 1 {(A) 2,(A) 3 }D(A2) 2 {(A) 4,(AN)}D(A2) 3 (A2) 1 D(E2) 1 (A2) 2 D(E2) 2 (A2) 3 D(E2) 3 {(E2) 1,(E2) 2 }D(A2) 4 (A2) 4 D(E2) 4 {(E2) 3 (E2) 4 }D(Aroot) (Aroot)D(Eroot) (Eroot)D{(BS) 1,(BS) 2 (BS) 1 D{(BS) 3,(BS) 4 }

Dependency graph TRACE THEORY BASED SCHEDULER

TRACE THEORY BASED SCHEDULER (A1)-(A) 1 -(A) 2 -(A) 3 -(A) 4 - (AN)-(A2) 1 -(A2) 2 - (A2) 3 -(E2) 1 -(E2) 2 -(E2) 3 - (A2) 4 - (E2) 4 - (Aroot)-(Eroot)-(BS) 1 -(BS) 2 -(BS) 3 -(BS) 4 Foata Normal Form 1 1 1 2 2 2 n n n a a... a a a... a... a a... a a 1 k i 2 A k i, j l k i 1 1 k k 1,..., lk ai Ia j k 1 k 1,..., l j 1,..., l a Da k 2 (alphabet) l 1 1 2 k 1 l n i j Scheduling according to Foata Normal Form: [(A1)(A) 1 (A) 2 (A) 3 (A) 4 (AN)][(A2) 1 (A2) 2 (A2) 3 ][(E2) 1 (E2) 2 (E2) 3 ] [(A2) 4 ][(E2) 4 ] [(Eroot)][(Aroot)][(Eroot)][(BS) 1 (BS) 2 ][(BS) 3 (BS) 4 ] Thus, the execution of the solver consists of several steps, where independent tasks are executed in concurrent, interchanged with the synchronization barriers.

GRAPH GRAMMAR PRODUCTIONS EXPRESSING THE SOLVER ALGORITHM Linear B-splines

GRAPH GRAMMAR PRODUCTIONS EXPRESSING THE SOLVER ALGORITHM Quadratic B-splines

1D NUMERICAL RESULTS LINEAR B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS QUADRATIC B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS CUBIC B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS QINTIC B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

COMPARISON WITH CPU MUMPS SOLVER NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock Intel(R) Core(TM)2 Quad CPU Q9400 with 2.66GHz clock, 8GB of memory

GENERATION OF 2D ELIMINATION TREE

2D NUMERICAL INTEGRATION

2D ELIMINATION

Size of the top dense problem O(N 0.5 ) C 0 COST Cost of the sequential dense solver O(M 3 ) Computational cost of sequential C 0 solver O((N 0.5 ) 3 )=O(N 3/2 ) Cost of the parallel dense solver O(M 2 ) Computational cost of sequential C 0 solver O((N 0.5 ) 2 )=O(N)

Size of the top dense problem O(pN 0.5 ) C p-1 COST Cost of the sequential dense solver O(M 3 ) Computational cost of sequential C 0 solver O((pN 0.5 ) 3 )=O(p 3 N 3/2 ) Cost of the parallel dense solver O(N 2 ) Computational cost of sequential C 0 solver O((pN 0.5 ) 2 )=O(p 2 N)

2D NUMERICAL RESULTS LINEAR B-SPLINES GeForce GTX 780 with 2304 cores, 3GB of memory

2D NUMERICAL RESULTS QUADRATIC B-SPLINES GeForce GTX 780 with 2304 cores, 3GB of memory

2D NUMERICAL RESULTS CUBIC B-SPLINES GeForce GTX 780 with 2304 cores, 3GB of memory

CONCLUSIONS We have developed the formal graph grammar model for B-splines based finite element method computations in one and two dimensions. The multi-frontal direct solver algorithm has been expressed by basic undividable tasks, and the partial order of execution has been expressed by a dependency graph The tasks can be scheduled in a sequence of sets with independent tasks based on the coloring of the dependency graph The model allows for an efficient implementation of the generation and solution of the problem in shared memory architecture The isogeometric shared memory C p-1 continuity parallel direct solver scales like O(p 2 log(n/p)) for one dimensional problems, and like O(Np 2 ) for two dimensional problems