Scheduling Strategies for Parallel Sparse Backward/Forward Substitution
|
|
- Tobias McKenzie
- 5 years ago
- Views:
Transcription
1 Scheduling Strategies for Parallel Sparse Backward/Forward Substitution J.I. Aliaga M. Bollhöfer A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ. Jaume I (Spain) {aliaga,martina,quintana}@icc.uji.es Institute of Computational Mathematics, TU-Braunschweig (Germany) m.bollhoefer@tu-braunschweig.de May, 8 J. I. Aliaga et. al. PARA 8@Trondheim / 4
2 Motivation and Introduction Motivation and Introduction Many numerical applications require the solution of LARGE and SPARSE linear systems preconditioned iterative solvers ILUPACK is a numerical (serial) package to solve Ax = b: Incomplete LU Decompositions (ILU) A M = LU Preconditioned Krylov Solvers Solve M Ax = M b J. I. Aliaga et. al. PARA 8@Trondheim / 4
3 Motivation and Introduction Motivation and Introduction Mid-term goal: Develop a parallel package to solve Ax = b on shared-memory multiprocessors using ILUPACK techniques Already Parallel ILU preconditioners for s.p.d. systems Focus: Parallel Forward (PFS) and Backward Substitution (PBS) stages of the iterative solution of the linear system Preconditioned Krylov Solver for j =,,..., until convergence do... Solve Ly j = b j Solve Ux j = y j... end for J. I. Aliaga et. al. PARA 8@Trondheim / 4
4 Outline Motivation and Introduction Motivation and Introduction Parallel ILU Preconditioners Data Decomposition Parallel ILU Computations Parallel ILU Execution Parallel Forward/Backward Substitution PFS Computations PBS Computations PFS and PBS Task mapping PFS and PBS Task Scheduling 4 Experimental Results 5 Conclusions J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
5 Data Decomposition Parallel ILU Preconditioners Data Decomposition Natural ordering MLND ordering (,) (,) (,) (,) (,) (,) (,4) Task tree Elimination Tree J. I. Aliaga et. al. PARA 5 / 4
6 Parallel ILU Preconditioners Data Decomposition Data Decomposition The task tree yields a block partitioning of A (,) (,) (,) (,) (,) (,4) (,) A M (,4 ) A (,4 ) How does our approach decompose A? J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4
7 Data Decomposition Parallel ILU Preconditioners Data Decomposition The path from task (, i) to the root maps A (,i) to A (, ) A M (, ) (,) M (, ) (,) (,) (,) (,) (,) (,4) A A = M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,4) A (,4) (M (,4) ) T J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4
8 Data Decomposition Parallel ILU Preconditioners Data Decomposition The path from task (, i) to the root maps A (,i) to A (,) M (,4 ) (,) (,) (,) (,) (,) (,4) (,4 ) A M (,4 ) A A = M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,) A (,) (M (,) ) T + M (,4) A (,4) (M (,4) ) T J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4
9 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks compute in parallel ILUPACK partial ILUs A (,i) A (,i) A (,i) L (,i) A (,i) A (,i) A (,i) L (,i) I A (,i) A (,i) A (,i) where i =,..., 4 L (,i) I U (,i) U (,i) U (,i) S (,i) S (,i) S (,i) S (,i) J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4
10 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks merge the Schur complements of its children ( ) ( ) ( ) (,i) A A (,i) (,i ) S A (,i) A (,i) S (,i ) (,i) S = S (,i ) S (,i ) S (,i) + S (,i) S (,i), where i =, J. I. Aliaga et. al. PARA 8@Trondheim / 4
11 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks compute in parallel ILUPACK partial ILUs, ( ) ( ) ( ) A (,i) A (,i) A (,i) A (,i) where now i =, L (,i) L (,i) I U (,i) U (,i) S (,i), J. I. Aliaga et. al. PARA 8@Trondheim / 4
12 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) The root task merges the Schur complements of its children A (,) = S (,) + S (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4
13 Parallel ILU Preconditioners Parallel ILU Computations Parallel ILU Computations (,) (,) (,) (,) (,) (,) (,4) Finally, the root task completes the parallel ILU A (,) L (,) U (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4
14 Task Scheduling Parallel ILU Preconditioners Parallel ILU Execution The task trees are constructed before the parallel ILU commences The execution is scheduled via a dynamic load-balancing strategy: Always priorizes leaves over inner tasks Among leaves, priorizes those with higher estimated cost The parallel execution results in a mapping of tasks to processors: f=4 P T T T T Execution p=4 P T T P P P T T4 T5 T6 T T4 T5 P T6 P T7 T8 T9 T T7 T8 T9 T P P P P Remark: excellent results on shared-memory multiprocessors J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
15 Parallel Forward/Backward Substitution Solve Ly = b and Ux = y, respectively, where L and U the sparse triangular factors obtained from the parallel multilevel ILU of A Assume we split b, y, x accordingly to A b M (, ) b (, ) (,) (, ) M (,) (,) (,) (,) (,) (,4) b = M (,) b (,) + M (,) b (,) + M (,) b (,) + M (,4) b (,4) J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4
16 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks perform in parallel partial forward substitutions L(,i) L (,i) I y (,i) y (,i) = b(,i) L (,i) b (,i), i =,... 4, I y (,i) b (,i) Solve L (,i) y (,i) = b (,i) SpTR (Forward Substitution) ) Update ( ˆb(,i) ˆb (,i) = ( b (,i) b (,i) ) ( L (,i) L (,i) ) y (,i) SpMxV J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4
17 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks merge the updates resulting from its children ( ) ( ) ( ) b (,i) ˆb(,i ) ˆb(,i) b (,i) = ˆb (,i ) + ˆb (,i), where i =, J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4
18 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks perform in parallel partial forward substitutions ( ) ( ) ( ) (,i) (,i) (,i) L y L (,i) b I y (,i) = b (,i), i =,, Solve L (,i) y (,i) = b (,i) SpTR (Forward Substitution) Update ˆb (,i) = b (,i) L (,i) y (,i) SpMxV J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4
19 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) The root task merges the updates resulting from its children b (,) (,) (,) = ˆb + ˆb J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4
20 Parallel Forward Substitution PFS Computations (,) (,) (,) (,) (,) (,) (,4) The root task completes the parallel forward substitution Solve L (,) y (,) = b (,) SpTR (Forward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4
21 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) The root task starts the parallel backward substitution Solve U (,) x (,) = y (,) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4
22 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) The root task provides copies of x (,) to its children (, ) x (,) (, ) x (,) J. I. Aliaga et. al. PARA 8@Trondheim / 4
23 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks perform in parallel partial backward substitutions ( ) ( ) ( ) U (,i) U (,i) (,i) (,i) x y I x (,i) = y (,i), i =,, Update ŷ (,i) = y (,i) U (,i) x (,i) SpMxV Solve U (,i) x (,i) = ŷ (,i) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim / 4
24 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) Second-level tasks provide copies to its children ( ) ( x (,) (, ), (, ) x (,) (, ), (, 4) x (,) x (,) ) J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
25 Parallel Backward Substitution PBS Computations (,) (,) (,) (,) (,) (,) (,4) First-level tasks compute in parallel partial backward substitutions U(,i) U (,i) U (,i) I x (,i) x (,i) = y (,i) y (,i), i =,... 4, I x (,i) y (,i) Update ŷ (,i) = y (,i) ( U (,i) U (,i) ) ( x (,i) x (,i) ) SpMxV Solve U (,i) x (,i) = ŷ (,i) SpTR (Backward Substitution) J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4
26 PFS and PBS Task mapping PFS and PBS Task mapping How to distribute the tasks among the processors? There is a wide range of solutions: Redistribute the tasks for each PFS and PBS execution dynamic-load balancing... Maintain the mapping resulting from the parallel ILU for the whole solution process J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4
27 PFS and PBS Task mapping PFS and PBS Task mapping Redistribute the tasks for each PFS and PBS execution May entail succesively moving the data structures On ccnuma and even with ccuma or multicore processors Our experimental analysis reveals that data movement outweighs some other advantages of redistributing Very restricted temporal locality of SpMxV, SpTR kernels J. I. Aliaga et. al. PARA 7 / 4
28 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU It can provide acceptable solutions if there are moderate variations between the relative costs of the tasks of the parallel ILU and those for the tasks of the PFS and the PBS Parallel ILU % % % 5% 5% % % % % % % We consider the mapping problem for s.p.d. matrices closely similar costs for the PFS and the PBS J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4
29 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU It can provide acceptable solutions if there are moderate variations between the relative costs of the tasks of the parallel ILU and those for the tasks of the PFS and the PBS Parallel ILU vs. PFS % moderate variations? % % 5% 5% % % % % % % We consider the mapping problem for s.p.d. matrices closely similar costs for the PFS and the PBS J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4
30 PFS and PBS Task mapping PFS and PBS Task mapping Maintain the mapping resulting from the parallel ILU For each task T i we define the relative cost ratio as: r PF i = rel. cost T i ILU rel. cost T i FS ri PF s close to moderate variations J. I. Aliaga et. al. PARA 8@Trondheim / 4
31 PFS and PBS Task mapping What do we get? i ) Relative cost ratio (r PF % G circuit f = 6 Leaf Tsks. Inner Tsks Task identifier (i) Task identifiers are assigned in descending relative cost J. I. Aliaga et. al. PARA 8@Trondheim / 4
32 PFS Task Scheduling PFS and PBS Task Scheduling For the task scheduling of the PFS: A thread can only execute tasks mapped to it Threads always priorize leaves over inner tasks Among leaves, threads priorize those with higher nnz(l (,i) ) Initially, we provide the leaves to their corresponding threads When a thread completes a task, it checks the dependencies of the parent task, and if they are resolved, then it provides the parent task to the corresponding thread J. I. Aliaga et. al. PARA 8@Trondheim / 4
33 PBS Task Scheduling PFS and PBS Task Scheduling Threads always priorize inner tasks over leaves T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim / 4
34 PBS Task Scheduling PFS and PBS Task Scheduling The PBS execution uncovers some pitfalls P T P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
35 PBS Task Scheduling PFS and PBS Task Scheduling The thread resolving the inner task becomes responsible T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 5 / 4
36 PBS Task Scheduling PFS and PBS Task Scheduling We allow some flexibility in the mapping of inner tasks T P P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P P T P T T P P P T T4 T5 P T6 P T7 T8 T9 T P P P P J. I. Aliaga et. al. PARA 8@Trondheim 6 / 4
37 Experimental Results Experimental Framework SGI Altix 5 CC-NUMA SMM with: 8 nodes, -processor-per-node, Intel Itanium GBytes of RAM shared via a SGI NUMAlink interconnect Intel Compiler OpenMP.5 compliance -O Intel Compiler optimization level One thread was binded per physical processor Whenever possible, one thread per node IEEE double precision J. I. Aliaga et. al. PARA 8@Trondheim 7 / 4
38 Experimental Results Benchmark Matrices Benchmark matrices from the UF sparse matrix collection Code Group/Name Rows/Cols. Nonzeros M GHS_psdef/bmwcra_ M Wissgott/parabolic_fem M Schmid/thermal M4 AMD/G_circuit J. I. Aliaga et. al. PARA 8@Trondheim 8 / 4
39 Experimental Results Experimental Results p =, 4, 8, 6 processors, and f = p, p, 4p Average parallel time in ms. for executions with the mapping resulting from the same parallel ILU Speed-Up measured with respect to the parallel algorithm executing the same task tree on a single processor Different values of f lead to different task trees p = /f = refer to ILUPACK serial routines J. I. Aliaga et. al. PARA 8@Trondheim 9 / 4
40 Experimental Results Experimental Results Name/Code m Algorithm PFS PBS f p T Sp T Sp m PFS PBS T Sp T Sp J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
41 Experimental Results Experimental Results Name/Code m Algorithm PFS PBS f p T Sp T Sp m4 PFS PBS T Sp T Sp J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
42 Conclusions Conclusions We have presented two parallel algorithms to compute FS and BS for the iterative solution of sparse linear systems on shared-memory multiprocessors The mapping resulting from the ILU provides acceptable solutions for the PBS and PFS The task scheduling strategies take care of some pitfalls which can significantly hurt the performance attained by the PBS Remarkable performance reported on a CC-NUMA platform with 6 processors J. I. Aliaga et. al. PARA 8@Trondheim 4 / 4
43 Conclusions Questions? J. I. Aliaga et. al. PARA 4 / 4
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms
Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms H. Anzt, V. Heuveline Karlsruhe Institute of Technology, Germany
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationExploiting Thread-Level Parallelism in the Iterative Solution of Sparse Linear Systems
Exploiting Thread-Level Parallelism in the Iterative Solution of Sparse Linear Systems José I. Aliaga a,1, Matthias Bollhöfer b,2, Alberto F. Martín a,1, Enrique S. Quintana-Ortí a,1 a Dpto. de Ingeniería
More informationA Parallel Implementation of the BDDC Method for Linear Elasticity
A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague
More informationImproving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationAccelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations
Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University
More informationCharacterizing the Efficiency of Multicore and Manycore Processors for the Solution of Sparse Linear Systems
Noname manuscript No. (will be inserted by the editor) Characterizing the Efficiency of Multicore and Manycore Processors for the Solution of Sparse Linear Systems José I. Aliaga María Barreda Ernesto
More informationApproaches to Parallel Implementation of the BDDC Method
Approaches to Parallel Implementation of the BDDC Method Jakub Šístek Includes joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík. Institute of Mathematics of the AS CR, Prague
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationPARDISO Version Reference Sheet Fortran
PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly
More informationA parallel direct/iterative solver based on a Schur complement approach
A parallel direct/iterative solver based on a Schur complement approach Gene around the world at CERFACS Jérémie Gaidamour LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix project) February 29th, 2008
More informationBatched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs
Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran
More informationAccelerating the Iterative Linear Solver for Reservoir Simulation
Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,
More informationSparse Linear Systems
1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite
More informationS0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS
S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS John R Appleyard Jeremy D Appleyard Polyhedron Software with acknowledgements to Mark A Wakefield Garf Bowen Schlumberger Outline of Talk Reservoir
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationParallel resolution of sparse linear systems by mixing direct and iterative methods
Parallel resolution of sparse linear systems by mixing direct and iterative methods Phyleas Meeting, Bordeaux J. Gaidamour, P. Hénon, J. Roman, Y. Saad LaBRI and INRIA Bordeaux - Sud-Ouest (ScAlApplix
More informationStrategies for Parallelizing the Solution of Rational Matrix Equations
Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí
More informationIntel MKL Sparse Solvers. Software Solutions Group - Developer Products Division
Intel MKL Sparse Solvers - Agenda Overview Direct Solvers Introduction PARDISO: main features PARDISO: advanced functionality DSS Performance data Iterative Solvers Performance Data Reference Copyright
More informationDistributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems
Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken German Aerospace Center (DLR) Simulation- and Software Technology
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationHIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach
HIPS : a parallel hybrid direct/iterative solver based on a Schur complement approach Mini-workshop PHyLeaS associated team J. Gaidamour, P. Hénon July 9, 28 HIPS : an hybrid direct/iterative solver /
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationSCALABLE ALGORITHMS for solving large sparse linear systems of equations
SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationTowards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers
Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationParallelization of GSL: Architecture, Interfaces, and Programming Models
Parallelization of GSL: Architecture, Interfaces, and Programming Models J. Aliaga 1,F.Almeida 2,J.M.Badía 1, S. Barrachina 1,V.Blanco 2, M. Castillo 1,U.Dorta 2,R.Mayo 1,E.S.Quintana 1,G.Quintana 1, C.
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationSolution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices
Solution of Out-of-Core Lower-Upper Decomposition for Complex Valued Matrices Marianne Spurrier and Joe Swartz, Lockheed Martin Corp. and ruce lack, Cray Inc. ASTRACT: Matrix decomposition and solution
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationTomonori Kouya Shizuoka Institute of Science and Technology Toyosawa, Fukuroi, Shizuoka Japan. October 5, 2018
arxiv:1411.2377v1 [math.na] 10 Nov 2014 A Highly Efficient Implementation of Multiple Precision Sparse Matrix-Vector Multiplication and Its Application to Product-type Krylov Subspace Methods Tomonori
More informationOn the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA
On the Parallel Solution of Sparse Triangular Linear Systems M. Naumov* San Jose, CA May 16, 2012 *NVIDIA Why Is This Interesting? There exist different classes of parallel problems Embarrassingly parallel
More informationProject Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting
Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert
More informationsmooth coefficients H. Köstler, U. Rüde
A robust multigrid solver for the optical flow problem with non- smooth coefficients H. Köstler, U. Rüde Overview Optical Flow Problem Data term and various regularizers A Robust Multigrid Solver Galerkin
More informationHYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE
HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationCombinatorial problems in a Parallel Hybrid Linear Solver
Combinatorial problems in a Parallel Hybrid Linear Solver Ichitaro Yamazaki and Xiaoye Li Lawrence Berkeley National Laboratory François-Henry Rouet and Bora Uçar ENSEEIHT-IRIT and LIP, ENS-Lyon SIAM workshop
More informationAdvanced Numerical Techniques for Cluster Computing
Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers
More informationSparse Direct Solvers for Extreme-Scale Computing
Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.3 Iterative Methods Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationA comparison of Algorithms for Sparse Matrix. Real-time Multibody Dynamic Simulation
A comparison of Algorithms for Sparse Matrix Factoring and Variable Reordering aimed at Real-time Multibody Dynamic Simulation Jose-Luis Torres-Moreno, Jose-Luis Blanco, Javier López-Martínez, Antonio
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationBlock Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems
Block Distributed Schur Complement Preconditioners for CFD Computations on Many-Core Systems Dr.-Ing. Achim Basermann, Melven Zöllner** German Aerospace Center (DLR) Simulation- and Software Technology
More informationSimple Parallel Biconnectivity Algorithms for Multicore Platforms
Simple Parallel Biconnectivity Algorithms for Multicore Platforms George M. Slota Kamesh Madduri The Pennsylvania State University HiPC 2014 December 17-20, 2014 Code, presentation available at graphanalysis.info
More informationSummer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics
Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one
More informationA General Sparse Sparse Linear System Solver and Its Application in OpenFOAM
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A General Sparse Sparse Linear System Solver and Its Application in OpenFOAM Murat Manguoglu * Middle East Technical University,
More informationOverview of Intel MKL Sparse BLAS. Software and Services Group Intel Corporation
Overview of Intel MKL Sparse BLAS Software and Services Group Intel Corporation Agenda Why and when sparse routines should be used instead of dense ones? Intel MKL Sparse BLAS functionality Sparse Matrix
More informationHartwig Anzt, Edmond Chow, Daniel B. Szyld, and Jack Dongarra. Report Novermber Revised February 2016
Domain Overlap for Iterative Sparse Triangular Solves on GPUs Hartwig Anzt, Edmond Chow, Daniel B. Szyld, and Jack Dongarra Report 15-11-24 Novermber 2015. Revised February 2016 Department of Mathematics
More informationIterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms
Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear
More informationOptimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning
Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationInternational Conference on Computational Science (ICCS 2017)
International Conference on Computational Science (ICCS 2017) Exploiting Hybrid Parallelism in the Kinematic Analysis of Multibody Systems Based on Group Equations G. Bernabé, J. C. Cano, J. Cuenca, A.
More informationGPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU
April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely
More informationarxiv: v1 [cs.ms] 2 Jun 2016
Parallel Triangular Solvers on GPU Zhangxin Chen, Hui Liu, and Bo Yang University of Calgary 2500 University Dr NW, Calgary, AB, Canada, T2N 1N4 {zhachen,hui.j.liu,yang6}@ucalgary.ca arxiv:1606.00541v1
More informationThe GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.
The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund
More informationSpeedup Altair RADIOSS Solvers Using NVIDIA GPU
Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012 Innovation Intelligence ALTAIR OVERVIEW Altair
More informationPreconditioning Linear Systems Arising from Graph Laplacians of Complex Networks
Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks Kevin Deweese 1 Erik Boman 2 1 Department of Computer Science University of California, Santa Barbara 2 Scalable Algorithms
More informationCS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel
CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999
More informationHigh-Performance Out-of-Core Sparse LU Factorization
High-Performance Out-of-Core Sparse LU Factorization John R. Gilbert Sivan Toledo Abstract We present an out-of-core sparse nonsymmetric LU-factorization algorithm with partial pivoting. We have implemented
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationA comparison of parallel rank-structured solvers
A comparison of parallel rank-structured solvers François-Henry Rouet Livermore Software Technology Corporation, Lawrence Berkeley National Laboratory Joint work with: - LSTC: J. Anton, C. Ashcraft, C.
More informationHarnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems
Harnessing CUDA Dynamic Parallelism for the Solution of Sparse Linear Systems José ALIAGA, a,1 Davor DAVIDOVIĆ b, Joaquín PÉREZ a, and Enrique S. QUINTANA-ORTÍ a, a Dpto. Ingeniería Ciencia de Computadores,
More informationParallel Threshold-based ILU Factorization
A short version of this paper appears in Supercomputing 997 Parallel Threshold-based ILU Factorization George Karypis and Vipin Kumar University of Minnesota, Department of Computer Science / Army HPC
More informationPARALUTION - a Library for Iterative Sparse Methods on CPU and GPU
- a Library for Iterative Sparse Methods on CPU and GPU Dimitar Lukarski Division of Scientific Computing Department of Information Technology Uppsala Programming for Multicore Architectures Research Center
More informationSparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009
Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009 Dan l Pierce, PhD, MBA, CEO & President AAI Joint with: Yukai Hung, Chia-Chi Liu, Yao-Hung Tsai, Weichung Wang, and David Yu Access
More informationConstruction and application of hierarchical matrix preconditioners
University of Iowa Iowa Research Online Theses and Dissertations 2008 Construction and application of hierarchical matrix preconditioners Fang Yang University of Iowa Copyright 2008 Fang Yang This dissertation
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationPerformance Evaluation of a New Parallel Preconditioner
Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The
More informationIntel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation
Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1 Legal Disclaimer INFORMATION IN
More informationHartwig Anzt, Edmond Chow, Daniel Szyld, and Jack Dongarra. Report Novermber 2015
Domain Overlap for Iterative Sparse Triangular Solves on GPUs Hartwig Anzt, Edmond Chow, Daniel Szyld, and Jack Dongarra Report 15-11-24 Novermber 2015 Department of Mathematics Temple University Philadelphia,
More informationswsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu
swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture
More informationSparse Matrices Direct methods
Sparse Matrices Direct methods Iain Duff STFC Rutherford Appleton Laboratory and CERFACS Summer School The 6th de Brùn Workshop. Linear Algebra and Matrix Theory: connections, applications and computations.
More informationApplying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems
V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationMultigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX10
Multigrid Method using OpenMP/MPI Hybrid Parallel Programming Model on Fujitsu FX0 Kengo Nakajima Information Technology enter, The University of Tokyo, Japan November 4 th, 0 Fujitsu Booth S Salt Lake
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationMathematics and Computer Science
Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR
More informationGPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA
GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA ANSYS Fluent 2 Fluent control flow Accelerate this first Non-linear iterations Assemble
More informationCOSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science
Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense
More informationIterative Sparse Triangular Solves for Preconditioning
Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt 1(B), Edmond Chow 2, and Jack Dongarra 1 1 University of Tennessee, Knoxville, TN, USA hanzt@icl.utk.edu, dongarra@eecs.utk.edu 2 Georgia
More informationToward robust hybrid parallel sparse solvers for large scale applications
Toward robust hybrid parallel sparse solvers for large scale applications Luc Giraud (INPT/INRIA) joint work with Azzam Haidar (CERFACS-INPT/IRIT) and Jean Roman (ENSEIRB, LaBRI and INRIA) 1st workshop
More informationThe Fast Multipole Method on NVIDIA GPUs and Multicore Processors
The Fast Multipole Method on NVIDIA GPUs and Multicore Processors Toru Takahashi, a Cris Cecka, b Eric Darve c a b c Department of Mechanical Science and Engineering, Nagoya University Institute for Applied
More informationLecture 17: More Fun With Sparse Matrices
Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you
More informationCSCE 411 Design and Analysis of Algorithms
CSCE 411 Design and Analysis of Algorithms Set 4: Transform and Conquer Slides by Prof. Jennifer Welch Spring 2014 CSCE 411, Spring 2014: Set 4 1 General Idea of Transform & Conquer 1. Transform the original
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationMatrix-free IPM with GPU acceleration
Matrix-free IPM with GPU acceleration Julian Hall, Edmund Smith and Jacek Gondzio School of Mathematics University of Edinburgh jajhall@ed.ac.uk 29th June 2011 Linear programming theory Primal-dual pair
More information