Parallel multi-frontal solver for isogeometric finite element methods on GPU

Similar documents
HYBRID DIRECT AND ITERATIVE SOLVER FOR H ADAPTIVE MESHES WITH POINT SINGULARITIES

Fast solvers for mesh-based computations

A direct solver with reutilization of previously-computed LU factorizations for h-adaptive finite element grids with point singularities

Optimization of Execution Time, Energy Consumption, and Accuracy during Finite Element Method Simulations

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

APPLICATION OF PROJECTION-BASED INTERPOLATION ALGORITHM FOR NON-STATIONARY PROBLEM

AMS527: Numerical Analysis II

IN FINITE element simulations usually two computation

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Fast Tridiagonal Solvers on GPU

Algorithms for construction of Element Partition Trees for Direct Solver executed over h refined grids with B-splines and C 0 separators

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Transfinite Interpolation Based Analysis

Fast Multipole and Related Algorithms

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Mathematical computations with GPUs

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Iterative Sparse Triangular Solves for Preconditioning

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Scan Primitives for GPU Computing

Introduction to GPGPU and GPU-architectures

Center for Computational Science

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Vector Quantization. A Many-Core Approach

Automated Finite Element Computations in the FEniCS Framework using GPUs

2009: The GPU Computing Tipping Point. Jen-Hsun Huang, CEO

Dense matching GPU implementation

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Modern GPUs (Graphics Processing Units)

A Simulated Annealing algorithm for GPU clusters

Applications of Berkeley s Dwarfs on Nvidia GPUs

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Accelerating Polynomial Homotopy Continuation on a Graphics Processing Unit with Double Double and Quad Double Arithmetic

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CUDA Architecture & Programming Model

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Dynamic Programming Algorithm for Generation of Optimal Elimination Trees for Multi-Frontal Direct Solver over h-refined Grids

Concurrent Manipulation of Dynamic Data Structures in OpenCL

Interpolation & Polynomial Approximation. Cubic Spline Interpolation II

Device Memories and Matrix Multiplication

PetIGA. A Framework for High Performance Isogeometric Analysis. Santa Fe, Argentina. Knoxville, United States. Thuwal, Saudi Arabia

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Multiple-Choice Test Spline Method Interpolation COMPLETE SOLUTION SET

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Lecture 1: Introduction and Computational Thinking

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

State of Art and Project Proposals Intensive Computation

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

General Purpose GPU Computing in Partial Wave Analysis

ME964 High Performance Computing for Engineering Applications

THE USE of graphics processing units (GPUs) in scientific

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

What is linear programming (LP)? NATCOR Convex Optimization Linear Programming 1. Solving LP problems: The standard simplex method

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU Concurrency: Weak Behaviours and Programming Assumptions

A MATRIX FORMULATION OF THE CUBIC BÉZIER CURVE

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

B. Tech. Project Second Stage Report on

High performance computing and the simplex method

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Programmable Graphics Hardware (GPU) A Primer

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

L10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Contents. I The Basic Framework for Stationary Problems 1

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Online Document Clustering Using the GPU

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

CUDA C Programming Mark Harris NVIDIA Corporation

The Fermi GPU and HPC Application Breakthroughs

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

A Parallel Implementation of the BDDC Method for Linear Elasticity

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

CME 213 S PRING Eric Darve

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs

Nostradamus Grand Cross. - N-body simulation!

HPC with Multicore and GPUs

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Transcription:

Parallel multi-frontal solver for isogeometric finite element methods on GPU Maciej Paszyński Department of Computer Science, AGH University of Science and Technology, Kraków, Poland email: paszynsk@agh.edu.pl Collaborators: Maciej Woźniak Krzysztof Kuźnik AGH University of Science and Technology, Kraków, Poland Victor Calo King Abdullah University of Science and Technology, Thuwal, Saudi Arabia David Pardo The University of the Basque Country, Basque Center for Applied Mathematics and IKERBASQUE (Basque Foundation of Science), Bilbao, Spain

AGH UNIVERSITY OF SCIENCE AND TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE, KRAKOW, POLAND

OUTLINE 1. Introduction 2. Graph grammar model for 1D Grammar productions expressing the generation of element elimination tree Grammar productions expressing the solver algorithm 3. Numerical results for 1D Execution time for p=1,2,3,4,5 Comparision with MUMPS for p=1,2,3,4,5 4. Generalization to 2D 5. C p-1 vs C 0 6. Numerical results for 2D Execution time for p=1,2,3 Comparision with MUMPS for p=1,2,3 7. Conclusions

INTRODUCTION B-SPLINES BASED FINITE ELEMENT METHOD Strong formulation d dx A x d u d x B x u0 0 d u 1 A 1 d x u1 d u d x C x u f x Weak formulation Find u V u H 1 0,1:u 0 0s.t. bv,u lv, v V v H 1 0,1:v 0 0 1 bv,u Ax d v d u d x d x B x v x d u d x C x v x u x d x v1u 1 0 lv v 1

INTRODUCTION B-SPLINES BASED FINITE ELEMENT METHOD Using B-splines as basis functions Find u V u H 1 0,1:u 0 0s.t. bv,u lv, v V v H 1 0,1:v 0 0 u i x N i,p x d v i x N j,p i bn j,p x,n i,p x a i ln j,p x, j x Linear B-splines contribution of b(n2,1;n3,1)

INTRODUCTION B-SPLINES BASED FINITE ELEMENT METHOD Using B-splines as basis functions Find u V u H 1 0,1:u 0 0s.t. bv,u lv, v V v H 1 0,1:v 0 0 u i x N i,p x d v i x N j,p i bn j,p x,n i,p x a i ln j,p x, j x Quadratic B-splines contribution of b(n2,2;n3,2)

GRAPH GRAMMAR GENERATION OF 1D ELIMINATION TREE 1D elimination tree obtained by executing productions (P1)-(P2) 2 -(P2) 2 -(P3) 6

GRAPH GRAMMAR PRODUCTIONS AS ATOMIC TASKS We assign indices to grammar productions in order to localize the places where the graph grammar productions were fired (P1)-(P2) 1 -(P2) 2 -(P2) 3 -(P2) 4 -(P3) 1 -(P3) 2 -(P3) 3 -(P3) 4 -(P3) 5 -(P3) 6

TRACE THEORY BASED SCHEDULER Alphabet: A = {(P1), (P2) 1, (P2) 2, (P2) 3, (P2) 4, (P3) 1, (P3) 2, (P3) 3, (P3) 4, (P3) 5, (P3) 6 } Dependency relation for construction of the elimination tree (P1)D{(P2) 1,(P2) 2 } (P2) 1 D{(P2) 3,(P2) 4 } (P2) 3 D{(P3) 1,(P3) 2 } (P2) 4 D{(P3) 3,(P3) 4 } (P2) 2 D{(P3) 5,(P3) 6 }

Dependency graph TRACE THEORY BASED SCHEDULER

Dependency graph TRACE THEORY BASED SCHEDULER

TRACE THEORY BASED SCHEDULER (P1)-(P2) 1 -(P2) 2 -(P2) 3 -(P2) 4 - (P3) 1 -(P3) 2 -(P3) 3 -(P3) 4 -(P3) 5 -(P3) 6 Foata Normal Form 1 1 1 2 2 2 n n n a a... a a a... a... a a... a a 1 k i 2 A k i, j l k i 1 1 k k 1,..., lk ai Ia j k 1 k 1,..., l j 1,..., l a Da k 2 (alphabet) l 1 1 2 k 1 l n i<>j where I=AxA\D i j Scheduling according to Foata Normal Form: [(P1)][(P2) 1 (P2) 2 ][(P2) 3 (P2) 4 (P3) 5 (P3) 6 ][(P3) 1 (P3) 2 (P3) 3 (P3) 4 ] Thus, the execution of the solver consists of several steps, where independent tasks are executed in concurrent, interchanged with the synchronization barriers.

GRAMMAR BASED NUMERICAL INTEGRATION b N j,p x,n i,p x 1 Ax d N x j,p d x 0 N j,p 1N i,p 1 x ln i,p N i,p 1 1 0 d N i,p x d x Bx N j,p x d N x i,p d x Cx N j,p x N i,p x d x using Gaussian quadrature the integration over the domain can be substituted by a weighted summation over Gauss points l d N x i,p A x w l A x l d x d N j,p x d x d N x i,p l d x d N j,p d x Bx d N x i,p d x x l B x l d N i,p N j,p d x x C x l N j,p x N i,p x l C x N j,p x l N i,p x d x x l N j,p x l

GRAMMAR BASED NUMERICAL INTEGRATION

GRAMMAR BASED NUMERICAL INTEGRATION

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Generation of frontal matrices at leaves of the eliminaton tree expressed as the execution of graph grammar productions (A1)-(A) 4 -(AN)

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar productions generating local frontal matrices for left boundary, interior and right boundary nodes for linear B-splines

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar productions merging element frontal matrices at parent level

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar production eliminating fully assembled row at parent level

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar production for merging element frontal matrices at root level Graph grammar production for solution at root level

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Graph grammar production for recursive backward substitution

PROCESS OF THE ELIMINATION EXPRESSED BY GRAPH GRAMMAR PRODUCTIONS Expression of the solver execution by graph grammar productions (A1)-(A) 4 -(AN) (generation of frontal matrices at leaves of the elimination trees) (A2) 3 (merging contributions at father nodes) (E2) 3 (elimination of fully assembled nodes) (A2) (E2) (merging at parent node followed by elimination) (Aroot) (Eroot) (merging at root node followed by full forward elimination) (BS) 4 (backward substitutions)

Alphabet: TRACE THEORY BASED SCHEDULER A={(A1), (A) 1, (A) 2, (A) 3, (A) 4, (AN), (A2) 1, (A2) 2, (A2) 3, (E2) 1, (E2) 2, (E2) 3, (A2) 4, (E2) 4, (Aroot), (Eroot), (BS) 1, (BS) 2, (BS) 3, (BS) 4 } Dependency relation for the solver algorithm {(A1),(A) 1 }D(A2) 1 {(A) 2,(A) 3 }D(A2) 2 {(A) 4,(AN)}D(A2) 3 (A2) 1 D(E2) 1 (A2) 2 D(E2) 2 (A2) 3 D(E2) 3 {(E2) 1,(E2) 2 }D(A2) 4 (A2) 4 D(E2) 4 {(E2) 3 (E2) 4 }D(Aroot) (Aroot)D(Eroot) (Eroot)D{(BS) 1,(BS) 2 (BS) 1 D{(BS) 3,(BS) 4 }

Dependency graph TRACE THEORY BASED SCHEDULER

Dependency graph TRACE THEORY BASED SCHEDULER

TRACE THEORY BASED SCHEDULER (A1)-(A) 1 -(A) 2 -(A) 3 -(A) 4 - (AN)-(A2) 1 -(A2) 2 - (A2) 3 -(E2) 1 -(E2) 2 -(E2) 3 - (A2) 4 - (E2) 4 - (Aroot)-(Eroot)-(BS) 1 -(BS) 2 -(BS) 3 -(BS) 4 Foata Normal Form 1 1 1 2 2 2 n n n a a... a a a... a... a a... a a 1 k i 2 A k i, j l k i 1 1 k k 1,..., lk ai Ia j k 1 k 1,..., l j 1,..., l a Da k 2 (alphabet) l 1 1 2 k 1 l n i j Scheduling according to Foata Normal Form: [(A1)(A) 1 (A) 2 (A) 3 (A) 4 (AN)][(A2) 1 (A2) 2 (A2) 3 ][(E2) 1 (E2) 2 (E2) 3 ] [(A2) 4 ][(E2) 4 ] [(Eroot)][(Aroot)][(Eroot)][(BS) 1 (BS) 2 ][(BS) 3 (BS) 4 ] Thus, the execution of the solver consists of several steps, where independent tasks are executed in concurrent, interchanged with the synchronization barriers.

GRAPH GRAMMAR PRODUCTIONS EXPRESSING THE SOLVER ALGORITHM Linear B-splines

GRAPH GRAMMAR PRODUCTIONS EXPRESSING THE SOLVER ALGORITHM Quadratic B-splines

1D NUMERICAL RESULTS LINEAR B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS QUADRATIC B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS CUBIC B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

1D NUMERICAL RESULTS QINTIC B-SPLINES NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock

COMPARISON WITH CPU MUMPS SOLVER NVidia Tesla c2070, 6GB memory, 448 CUDA cores, each one with 1.15GHz clock Intel(R) Core(TM)2 Quad CPU Q9400 with 2.66GHz clock, 8GB of memory

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

GENERATION OF 2D ELIMINATION TREE

2D NUMERICAL INTEGRATION

2D ELIMINATION

2D ELIMINATION

2D ELIMINATION

Size of the top dense problem O(N 0.5 ) C 0 COST Cost of the sequential dense solver O(M 3 ) Computational cost of sequential C 0 solver O((N 0.5 ) 3 )=O(N 3/2 ) Cost of the parallel dense solver O(M 2 ) Computational cost of sequential C 0 solver O((N 0.5 ) 2 )=O(N)

Size of the top dense problem O(pN 0.5 ) C p-1 COST Cost of the sequential dense solver O(M 3 ) Computational cost of sequential C 0 solver O((pN 0.5 ) 3 )=O(p 3 N 3/2 ) Cost of the parallel dense solver O(N 2 ) Computational cost of sequential C 0 solver O((pN 0.5 ) 2 )=O(p 2 N)

2D NUMERICAL RESULTS LINEAR B-SPLINES GeForce GTX 780 with 2304 cores, 3GB of memory

2D NUMERICAL RESULTS QUADRATIC B-SPLINES GeForce GTX 780 with 2304 cores, 3GB of memory

2D NUMERICAL RESULTS CUBIC B-SPLINES GeForce GTX 780 with 2304 cores, 3GB of memory

CONCLUSIONS We have developed the formal graph grammar model for B-splines based finite element method computations in one and two dimensions. The multi-frontal direct solver algorithm has been expressed by basic undividable tasks, and the partial order of execution has been expressed by a dependency graph The tasks can be scheduled in a sequence of sets with independent tasks based on the coloring of the dependency graph The model allows for an efficient implementation of the generation and solution of the problem in shared memory architecture The isogeometric shared memory C p-1 continuity parallel direct solver scales like O(p 2 log(n/p)) for one dimensional problems, and like O(Np 2 ) for two dimensional problems