Scheduling Strategies for Parallel Sparse Backward/Forward Substitution

Size: px
Start display at page:

Download "Scheduling Strategies for Parallel Sparse Backward/Forward Substitution"

Transcription

1 Scheduling Strategies for Parallel Sparse Backward/Forward Substitution J.I. Aliaga M. Bollhöfer A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ. Jaume I (Spain) {aliaga,martina,quintana}@icc.uji.es Institute of Computational Mathematics, TU-Braunschweig (Germany) m.bollhoefer@tu-braunschweig.de May, 8 J. I. Aliaga et. al. PARA 8@Trondheim / 4

2 Motivation and Introduction Motivation and Introduction Many numerical applications require the solution of LARGE and SPARSE linear systems preconditioned iterative solvers ILUPACK is a numerical (serial) package to solve Ax = b: Incomplete LU Decompositions (ILU) A M = LU Preconditioned Krylov Solvers Solve M Ax = M b J. I. Aliaga et. al. PARA 8@Trondheim / 4

3 Motivation and Introduction Motivation and Introduction Mid-term goal: Develop a parallel package to solve Ax = b on shared-memory multiprocessors using ILUPACK techniques Already Parallel ILU preconditioners for s.p.d. systems Focus: Parallel Forward (PFS) and Backward Substitution (PBS) stages of the iterative solution of the linear system Preconditioned Krylov Solver for j =,,..., until convergence do... Solve Ly j = b j Solve Ux j = y j... end for J. I. Aliaga et. al. PARA 8@Trondheim / 4

4 Outline Motivation and Introduction Motivation and Introduction Parallel ILU Preconditioners Data Decomposition Parallel ILU Computations Parallel ILU Execution Parallel Forward/Backward Substitution PFS Computations PBS Computations PFS and PBS Task mapping PFS and PBS Task Scheduling 4 Experimental Results 5 Conclusions J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

5 Data Decomposition Parallel ILU Preconditioners Data Decomposition Natural ordering MLND ordering (,) (,) (,) (,) (,) (,) (,4) Task tree Elimination Tree J. I. Aliaga et. al. PARA 5 / 4

6 Parallel ILU Preconditioners Data Decomposition Data Decomposition The task tree yields a block partitioning of A (,) (,) (,) (,) (,) (,4) (,) A M (,4 ) A (,4 ) How does our approach decompose A? J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

7 Data Decomposition Parallel ILU Preconditioners Data Decomposition The path from task (, i) to the root maps A (,i) to A (, ) A M (, ) (,) M (, ) (,) (,) (,) (,) (,) (,4) A A = M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,4) A (,4) (M (,4) ) T J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4

8 Data Decomposition Parallel ILU Preconditioners Data Decomposition The path from task (, i) to the root maps A (,i) to A (,) M (,4 ) (,) (,) (,) (,) (,) (,4) (,4 ) A M (,4 ) A A = M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,4) A (,4) (M (,4) ) T J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

9 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks compute in parallel ILUPACK partial ILUs A (,i) A (,i) A (,i) L (,i) A (,i) A (,i) A (,i) L (,i) I A (,i) A (,i) A (,i) where i =,..., 4 L (,i) I U (,i) U (,i) U (,i) S (,i) S (,i) S (,i) S (,i) J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

10 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks merge the Schur complements of its children ( ) ( ) ( ) (,i) A A (,i) (,i ) S A (,i) A (,i) S (,i ) (,i) S = S (,i ) S (,i ) S (,i) + S (,i) S (,i), where i =, J. I. Aliaga et. al. PARA 8@Trondheim / 4

11 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks compute in parallel ILUPACK partial ILUs, ( ) ( ) ( ) A (,i) A (,i) A (,i) A (,i) where now i =, L (,i) L (,i) I U (,i) U (,i) S (,i), J. I. Aliaga et. al. PARA 8@Trondheim / 4

12 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) The root task merges the Schur complements of its children A (,) = S (,) + S (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4

13 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Finally, the root task completes the parallel ILU A (,) L (,) U (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4

14 Task Scheduling Parallel ILU Preconditioners Parallel ILU Execution The task trees are constructed before the parallel ILU commences The execution is scheduled via a dynamic load-balancing strategy: Always priorizes leaves over inner tasks Among leaves, priorizes those with higher estimated cost The parallel execution results in a mapping of tasks to processors: f=4 P T T T T Execution p=4 P T T P P P T T4 T5 T6 T T4 T5 P T6 P T7 T8 T9 T T7 T8 T9 T P P P P Remark: excellent results on shared-memory multiprocessors J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

15 Parallel Forward/Backward Substitution Solve Ly = b and Ux = y, respectively, where L and U the sparse triangular factors obtained from the parallel multilevel ILU of A Assume we split b, y, x accordingly to A b M (, ) b (, ) (,) (, ) M (,) (,) (,) (,) (,) (,4) b = M (,) b (,) + M (,) b (,) + M (,) b (,) + M (,4) b (,4) J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4

16 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks perform in parallel partial forward substitutions L(,i) L (,i) I y (,i) y (,i) = b(,i) L (,i) b (,i), i =,... 4, I y (,i) b (,i) Solve L (,i) y (,i) = b (,i) SpTR (Forward Substitution) ) Update ( ˆb(,i) ˆb (,i) = ( b (,i) b (,i) ) ( L (,i) L (,i) ) y (,i) SpMxV J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

17 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks merge the updates resulting from its children ( ) ( ) ( ) b (,i) ˆb(,i ) ˆb(,i) b (,i) = ˆb (,i ) + ˆb (,i), where i =, J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4

18 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks perform in parallel partial forward substitutions ( ) ( ) ( ) (,i) (,i) (,i) L y L (,i) b I y (,i) = b (,i), i =,, Solve L (,i) y (,i) = b (,i) SpTR (Forward Substitution) Update ˆb (,i) = b (,i) L (,i) y (,i) SpMxV J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

19 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) The root task merges the updates resulting from its children b (,) (,) (,) = ˆb + ˆb J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

20 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) The root task completes the parallel forward substitution Solve L (,) y (,) = b (,) SpTR (Forward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4

21 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) The root task starts the parallel backward substitution Solve U (,) x (,) = y (,) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4

22 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) The root task provides copies of x (,) to its children (, ) x (,) (, ) x (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4

23 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks perform in parallel partial backward substitutions ( ) ( ) ( ) U (,i) U (,i) (,i) (,i) x y I x (,i) = y (,i), i =,, Update ŷ (,i) = y (,i) U (,i) x (,i) SpMxV Solve U (,i) x (,i) = ŷ (,i) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4

24 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks provide copies to its children ( ) ( x (,) (, ), (, ) x (,) (, ), (, 4) x (,) x (,) ) J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

25 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks compute in parallel partial backward substitutions U(,i) U (,i) U (,i) I x (,i) x (,i) = y (,i) y (,i), i =,... 4, I x (,i) y (,i) Update ŷ (,i) = y (,i) ( U (,i) U (,i) ) ( x (,i) x (,i) ) SpMxV Solve U (,i) x (,i) = ŷ (,i) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4

26 PFS and PBS Task mapping PFS and PBS Task mapping How to distribute the tasks among the processors? There is a wide range of solutions: Redistribute the tasks for each PFS and PBS execution dynamic-load balancing... Maintain the mapping resulting from the parallel ILU for the whole solution process J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

27 PFS and PBS Task mapping PFS and PBS Task mapping Redistribute the tasks for each PFS and PBS execution May entail succesively moving the data structures On ccnuma and even with ccuma or multicore processors Our experimental analysis reveals that data movement outweighs some other advantages of redistributing Very restricted temporal locality of SpMxV, SpTR kernels J. I. Aliaga et. al. PARA 7 / 4

28 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU It can provide acceptable solutions if there are moderate variations between the relative costs of the tasks of the parallel ILU and those for the tasks of the PFS and the PBS Parallel ILU % % % 5% 5% % % % % % % We consider the mapping problem for s.p.d. matrices closely similar costs for the PFS and the PBS J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

29 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU It can provide acceptable solutions if there are moderate variations between the relative costs of the tasks of the parallel ILU and those for the tasks of the PFS and the PBS Parallel ILU vs. PFS % moderate variations? % % 5% 5% % % % % % % We consider the mapping problem for s.p.d. matrices closely similar costs for the PFS and the PBS J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

30 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU For each task T i we define the relative cost ratio as: r PF i = rel. cost T i ILU rel. cost T i FS ri PF s close to moderate variations J. I. Aliaga et. al. PARA 8@Trondheim / 4

31 PFS and PBS Task mapping What do we get? i ) Relative cost ratio (r PF % G circuit f = 6 Leaf Tsks. Inner Tsks Task identifier (i) Task identifiers are assigned in descending relative cost J. I. Aliaga et. al. PARA 8@Trondheim / 4

32 PFS Task Scheduling PFS and PBS Task Scheduling For the task scheduling of the PFS: A thread can only execute tasks mapped to it Threads always priorize leaves over inner tasks Among leaves, threads priorize those with higher nnz(l (,i) ) Initially, we provide the leaves to their corresponding threads When a thread completes a task, it checks the dependencies of the parent task, and if they are resolved, then it provides the parent task to the corresponding thread J. I. Aliaga et. al. PARA 8@Trondheim / 4

33 PBS Task Scheduling PFS and PBS Task Scheduling Threads always priorize inner tasks over leaves T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim / 4

34 PBS Task Scheduling PFS and PBS Task Scheduling The PBS execution uncovers some pitfalls P T P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

35 PBS Task Scheduling PFS and PBS Task Scheduling The thread resolving the inner task becomes responsible T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4

36 PBS Task Scheduling PFS and PBS Task Scheduling We allow some flexibility in the mapping of inner tasks T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P P T P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4

37 Experimental Results Experimental Framework SGI Altix 5 CC-NUMA SMM with: 8 nodes, -processor-per-node, Intel Itanium GBytes of RAM shared via a SGI NUMAlink interconnect Intel Compiler OpenMP.5 compliance -O Intel Compiler optimization level One thread was binded per physical processor Whenever possible, one thread per node IEEE double precision J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4

38 Experimental Results Benchmark Matrices Benchmark matrices from the UF sparse matrix collection Code Group/Name Rows/Cols. Nonzeros M GHS_psdef/bmwcra_ M Wissgott/parabolic_fem M Schmid/thermal M4 AMD/G_circuit J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4

39 Experimental Results Experimental Results p =, 4, 8, 6 processors, and f = p, p, 4p Average parallel time in ms. for executions with the mapping resulting from the same parallel ILU Speed-Up measured with respect to the parallel algorithm executing the same task tree on a single processor Different values of f lead to different task trees p = /f = refer to ILUPACK serial routines J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4

40 Experimental Results Experimental Results Name/Code m Algorithm PFS PBS f p T Sp T Sp m PFS PBS T Sp T Sp J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

41 Experimental Results Experimental Results Name/Code m Algorithm PFS PBS f p T Sp T Sp m4 PFS PBS T Sp T Sp J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

42 Conclusions Conclusions We have presented two parallel algorithms to compute FS and BS for the iterative solution of sparse linear systems on shared-memory multiprocessors The mapping resulting from the ILU provides acceptable solutions for the PBS and PFS The task scheduling strategies take care of some pitfalls which can significantly hurt the performance attained by the PBS Remarkable performance reported on a CC-NUMA platform with 6 processors J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4

43 Conclusions Questions? J. I. Aliaga et. al. PARA 4 / 4

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Exploiting Thread-Level Parallelism in the Iterative Solution of Sparse Linear Systems

Exploiting Thread-Level Parallelism in the Iterative Solution of Sparse Linear Systems Exploiting Thread-Level Parallelism in the Iterative Solution of Sparse Linear Systems José I. Aliaga a,1, Matthias Bollhöfer b,2, Alberto F. Martín a,1, Enrique S. Quintana-Ortí a,1 a Dpto. de Ingeniería

More information

A Parallel Implementation of the BDDC Method for Linear Elasticity

A Parallel Implementation of the BDDC Method for Linear Elasticity A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague

More information

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Characterizing the Efficiency of Multicore and Manycore Processors for the Solution of Sparse Linear Systems

Characterizing the Efficiency of Multicore and Manycore Processors for the Solution of Sparse Linear Systems Noname manuscript No. (will be inserted by the editor) Characterizing the Efficiency of Multicore and Manycore Processors for the Solution of Sparse Linear Systems José I. Aliaga María Barreda Ernesto

More information

Approaches to Parallel Implementation of the BDDC Method

Approaches to Parallel Implementation of the BDDC Method Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

A parallel direct/iterative solver based on a Schur complement approach

A parallel direct/iterative solver based on a Schur complement approach A parallel direct/iterative solver based on a Schur complement approach Gene around the world at CERFACS Jérémie Gaidamour LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix project) February 29th, 2008

More information

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

Sparse Linear Systems

Sparse Linear Systems 1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite

More information

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Parallel resolution of sparse linear systems by mixing direct and iterative methods

Parallel resolution of sparse linear systems by mixing direct and iterative methods Parallel resolution of sparse linear systems by mixing direct and iterative methods Phyleas Meeting, Bordeaux J. Gaidamour, P. Hénon, J. Roman, Y. Saad LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix

More information

Strategies for Parallelizing the Solution of Rational Matrix Equations

Strategies for Parallelizing the Solution of Rational Matrix Equations Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí

More information

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division Intel MKL Sparse Solvers - Agenda Overview Direct Solvers Introduction PARDISO: main features PARDISO: advanced functionality DSS Performance data Iterative Solvers Performance Data Reference Copyright

More information

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken German Aerospace Center (DLR) Simulation- and Software Technology

More information

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach

HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach Mini-workshop PHyLeaS associated team J. Gaidamour, P. Hénon July 9, 28 HIPS : an hybrid direct/iterative solver /

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

SCALABLE ALGORITHMS for solving large sparse linear systems of equations

SCALABLE ALGORITHMS for solving large sparse linear systems of equations SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Parallelization of GSL: Architecture, Interfaces, and Programming Models

Parallelization of GSL: Architecture, Interfaces, and Programming Models Parallelization of GSL: Architecture, Interfaces, and Programming Models J. Aliaga 1,F.Almeida 2,J.M.Badía 1, S. Barrachina 1,V.Blanco 2, M. Castillo 1,U.Dorta 2,R.Mayo 1,E.S.Quintana 1,G.Quintana 1, C.

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices

Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Marianne Spurrier and Joe Swartz, Lockheed Martin Corp. and ruce lack, Cray Inc. ASTRACT: Matrix decomposition and solution

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018

Tomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018 arxiv:1411.2377v1 [math.na] 10 Nov 2014 A Highly Efficient Implementation of Multiple Precision Sparse Matrix-Vector Multiplication and Its Application to Product-type Krylov Subspace Methods Tomonori

More information

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA On the Parallel Solution of Sparse Triangular Linear Systems M. Naumov* San Jose, CA May 16, 2012 *NVIDIA Why Is This Interesting? There exist different classes of parallel problems Embarrassingly parallel

More information

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

smooth coefficients H. Köstler, U. Rüde

smooth coefficients H. Köstler, U. Rüde A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Combinatorial problems in a Parallel Hybrid Linear Solver

Combinatorial problems in a Parallel Hybrid Linear Solver Combinatorial problems in a Parallel Hybrid Linear Solver Ichitaro Yamazaki and Xiaoye Li Lawrence Berkeley National Laboratory François-Henry Rouet and Bora Uçar ENSEEIHT-IRIT and LIP, ENS-Lyon SIAM workshop

More information

Advanced Numerical Techniques for Cluster Computing

Advanced Numerical Techniques for Cluster Computing Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

A comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation

A comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation A comparison of Algorithms for Sparse Matrix Factoring and Variable Reordering aimed at Real-time Multibody Dynamic Simulation Jose-Luis Torres-Moreno, Jose-Luis Blanco, Javier López-Martínez, Antonio

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems

Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems Dr.-Ing. Achim Basermann, Melven Zöllner** German Aerospace Center (DLR) Simulation- and Software Technology

More information

Simple Parallel Biconnectivity Algorithms for Multicore Platforms

Simple Parallel Biconnectivity Algorithms for Multicore Platforms Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info

More information

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one

More information

A General Sparse Sparse Linear System Solver and Its Application in OpenFOAM

A General Sparse Sparse Linear System Solver and Its Application in OpenFOAM Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A General Sparse Sparse Linear System Solver and Its Application in OpenFOAM Murat Manguoglu * Middle East Technical University,

More information

Overview of Intel MKL Sparse BLAS. Software and Services Group Intel Corporation

Overview of Intel MKL Sparse BLAS. Software and Services Group Intel Corporation Overview of Intel MKL Sparse BLAS Software and Services Group Intel Corporation Agenda Why and when sparse routines should be used instead of dense ones? Intel MKL Sparse BLAS functionality Sparse Matrix

More information

Hartwig Anzt, Edmond Chow, Daniel B. Szyld, and Jack Dongarra. Report Novermber Revised February 2016

Hartwig Anzt, Edmond Chow, Daniel B. Szyld, and Jack Dongarra. Report Novermber Revised February 2016 Domain Overlap for Iterative Sparse Triangular Solves on GPUs Hartwig Anzt, Edmond Chow, Daniel B. Szyld, and Jack Dongarra Report 15-11-24 Novermber 2015. Revised February 2016 Department of Mathematics

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

International Conference on Computational Science (ICCS 2017)

International Conference on Computational Science (ICCS 2017) International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A.

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

arxiv: v1 [cs.ms] 2 Jun 2016

arxiv: v1 [cs.ms] 2 Jun 2016 Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1

More information

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke. The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund

More information

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Speedup Altair RADIOSS Solvers Using NVIDIA GPU Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair

More information

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Kevin Deweese 1 Erik Boman 2 1 Department of Computer Science University of California, Santa Barbara 2 Scalable Algorithms

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

High-Performance Out-of-Core Sparse LU Factorization

High-Performance Out-of-Core Sparse LU Factorization High-Performance Out-of-Core Sparse LU Factorization John R. Gilbert Sivan Toledo Abstract We present an out-of-core sparse nonsymmetric LU-factorization algorithm with partial pivoting. We have implemented

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance

More information

A comparison of parallel rank-structured solvers

A comparison of parallel rank-structured solvers A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.

More information

Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems

Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems José ALIAGA, a,1 Davor DAVIDOVIĆ b, Joaquín PÉREZ a, and Enrique S. QUINTANA-ORTÍ a, a Dpto. Ingeniería Ciencia de Computadores,

More information

Parallel Threshold-based ILU Factorization

Parallel Threshold-based ILU Factorization A short version of this paper appears in Supercomputing 997 Parallel Threshold-based ILU Factorization George Karypis and Vipin Kumar University of Minnesota, Department of Computer Science / Army HPC

More information

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU - a Library for Iterative Sparse Methods on CPU and GPU Dimitar Lukarski Division of Scientific Computing Department of Information Technology Uppsala Programming for Multicore Architectures Research Center

More information

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009 Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009 Dan l Pierce, PhD, MBA, CEO & President AAI Joint with: Yukai Hung, Chia-Chi Liu, Yao-Hung Tsai, Weichung Wang, and David Yu Access

More information

Construction and application of hierarchical matrix preconditioners

Construction and application of hierarchical matrix preconditioners University of Iowa Iowa Research Online Theses and Dissertations 2008 Construction and application of hierarchical matrix preconditioners Fang Yang University of Iowa Copyright 2008 Fang Yang This dissertation

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

Performance Evaluation of a New Parallel Preconditioner

Performance Evaluation of a New Parallel Preconditioner Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The

More information

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1 Legal Disclaimer INFORMATION IN

More information

Hartwig Anzt, Edmond Chow, Daniel Szyld, and Jack Dongarra. Report Novermber 2015

Hartwig Anzt, Edmond Chow, Daniel Szyld, and Jack Dongarra. Report Novermber 2015 Domain Overlap for Iterative Sparse Triangular Solves on GPUs Hartwig Anzt, Edmond Chow, Daniel Szyld, and Jack Dongarra Report 15-11-24 Novermber 2015 Department of Mathematics Temple University Philadelphia,

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

Sparse Matrices Direct methods

Sparse Matrices Direct methods Sparse Matrices Direct methods Iain Duff STFC Rutherford Appleton Laboratory and CERFACS Summer School The 6th de Brùn Workshop. Linear Algebra and Matrix Theory: connections, applications and computations.

More information

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10

Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10 Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX0 Kengo Nakajima Information Technology enter, The University of Tokyo, Japan November 4 th, 0 Fujitsu Booth S Salt Lake

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

Mathematics and Computer Science

Mathematics and Computer Science Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR

More information

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble

More information

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt 1(B), Edmond Chow 2, and Jack Dongarra 1 1 University of Tennessee, Knoxville, TN, USA hanzt@icl.utk.edu, dongarra@eecs.utk.edu 2 Georgia

More information

Toward robust hybrid parallel sparse solvers for large scale applications

Toward robust hybrid parallel sparse solvers for large scale applications Toward robust hybrid parallel sparse solvers for large scale applications Luc Giraud (INPT/INRIA) joint work with Azzam Haidar (CERFACS-INPT/IRIT) and Jean Roman (ENSEIRB, LaBRI and INRIA) 1st workshop

More information

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied

More information

Lecture 17: More Fun With Sparse Matrices

Lecture 17: More Fun With Sparse Matrices Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you

More information

CSCE 411 Design and Analysis of Algorithms

CSCE 411 Design and Analysis of Algorithms CSCE 411 Design and Analysis of Algorithms Set 4: Transform and Conquer Slides by Prof. Jennifer Welch Spring 2014 CSCE 411, Spring 2014: Set 4 1 General Idea of Transform & Conquer 1. Transform the original

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Matrix-free IPM with GPU acceleration

Matrix-free IPM with GPU acceleration Matrix-free IPM with GPU acceleration Julian Hall, Edmund Smith and Jacek Gondzio School of Mathematics University of Edinburgh jajhall@ed.ac.uk 29th June 2011 Linear programming theory Primal-dual pair

More information