PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Size: px

Start display at page:

Download "PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU"

Juniper Matthews
5 years ago
Views:

1 - a Library for Iterative Sparse Methods on CPU and GPU Dimitar Lukarski Division of Scientific Computing Department of Information Technology Uppsala Programming for Multicore Architectures Research Center (UPMARC) Uppsala University, Sweden SC13, Nov 20, 2013

2 Goal Create a library for iterative methods (linear and non-linear systems)! Math Linear operators (sparse/dense matrices, stencils) Vector routines Extendable Software Current multi/many-core dev, new hardware Portable code

3 - Library for Iterative Methods Hardware abstraction Targeting devices: CPUs + Accelerators (GPUs,...) Run time type identification (RTTI) Portable results and code Easy to use No knowledge of OpenMP/CUDA/OpenCL required No special library/hardware required Open source

4 Middle-ware C++ code FORTRAN OpenFOAM Deal II Scientific libraries/packages Multi/many-core CPU GPU New up coming technology

5 Operators and Vectors Operators/Vectors Dynamic Switch Accelerator New Backend OpenMP Intel MIC OpenMP GPU, Accelerators OpenCL GPU CUDA

6 Vector and Matrix Routines Init, Clear File formats Permutation Copy functions Some wrapper to Intel/MKL Dot product Vector updates Norm... Various Formats Matrix-vector multiplication Graph analyzer Factorization

7 Source Example LocalMatrix<double> A; LocalVector<double> x, y ; A.ReadFileMTX("my_matrix.mtx"); x.allocate("vector1", mat.get_nrow()); y.allocate("vector2", mat.get_ncol()); // y = A x A.Apply(x, &y); // Print the dot product of x and y std::cout << x.dot(y) << std::endl;

8 Source Example LocalMatrix<double> A; LocalVector<double> x, y ; A.ReadFileMTX("my_matrix.mtx"); x.allocate("vector1", mat.get_nrow()); y.allocate("vector2", mat.get_ncol()); A.MoveToAccelerator(); x.movetoaccelerator(); y.movetoaccelerator(); A.Apply(x, &y); std::cout << x.dot(y) << std::endl;

9 Initialization/Shutdown #include <paralution.hpp> using namespace paralution; int main(int argc, char* argv[]) { init_paralution(); info_paralution(); //... stop_paralution(); }

Solvers and Preconditioners Saddle-Point SPAI Multigrid AMG/GMG Solvers FSAI Mixed-Precision Defect-Correction Preconditioners MultiElimination ILU Chebyshev Iteration Iteration

10 Solvers and Preconditioners Saddle-Point SPAI Multigrid AMG/GMG Solvers FSAI Mixed-Precision Defect-Correction Preconditioners MultiElimination ILU Chebyshev Iteration Iteration Control MultiColored ILU Fixed-Iteration Schemes CG, BiCGStab, GMRES, IDR Incomplete LU MultiColored GS/SGS/SOR/SSOR Chebyshev All solvers can be used as preconditioners in a nested way.

11 CG Solver CG<LocalMatrix<double>, LocalVector<double>, double> cg; cg.setoperator(mat); cg.build(); cg.solve(rhs, &x);

12 CG Solver... cg.movetoaccelerator(); mat.movetoaccelerator(); rhs.movetoaccelerator(); x.movetoaccelerator(); cg.solve(rhs, &x);

13 Preconditioned CG Solver CG<LocalMatrix<double>, LocalVector<double>, double > cg; MultiColoredILU<LocalMatrix<double>, LocalVector<double>, double > p; cg.setoperator(mat); cg.setpreconditioner(p); cg.build(); cg.solve(rhs, &x);

14 Design and Concepts Hardware decision At run time No template parameter Internal check MoveToHost/Accelerator All objects (matrices, vectors, solvers, preconditioners, etc) Always a CPU implementation Template ValueType - float, double, int Solvers - Operator/Vector type

15 Library Performance

16 PCG Test Preconditioned CG for solving Ax = b Laplace matrix, 4M unknowns Relative residual stopping criteria of 10 6 b = 1.0 and x = 0.0 Hardware configuration 2x Intel Xeon E Intel Xeon Phi (MIC) 5110 (ECC) AMD FirePro (Tahiti) W8000 (ECC) NVIDIA K20c (ECC) NVIDIA K20X (ECC)

17 PCG on Various Hardware time [sec] PCG(MCILU0) - 4M 2D Laplace, eps=1e-6 CSR ELL x Xeon MIC 5110 FirePro K20c K20X

18 CFD Problem OpenFOAM Incompressible NS 6.8M 3D Cavity (190x190x190) icofoam delta t = 0.25e 5 Pressure solver only absolute tolerance of 1e 6 based on L1 and L2 norm Hardware configuration 2x Intel Xeon E NVIDIA K20c (ECC)

19 CFD Problem GAMG vs PCG-AMG* # iter Time [s] Speed-up OpenFOAM OpenFOAM MPI OMP L OMP L GPU L GPU L *Preliminary results (still under development)!

20 and Ongoing Work Various iterative solvers and preconditioners Backends for OpenMP, CUDA, OpenCL Open for new hardware Easy intergration (plug-ins) Portable results and code Ongoing work MPI support AMG

21 Thank You for Your Attention

Paralution & ViennaCL

Paralution & ViennaCL Clemens Schiffer June 12, 2014 Clemens Schiffer (Uni Graz) Paralution & ViennaCL June 12, 2014 1 / 32 Introduction Clemens Schiffer (Uni Graz) Paralution & ViennaCL June 12, 2014