Task based parallelization of recursive linear algebra routines using Kaapi

Size: px
Start display at page:

Download "Task based parallelization of recursive linear algebra routines using Kaapi"

Transcription

1 Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée Runtime, Paris. Supported by OpenDreamKit Horizon 2020 European Research Infrastructures project (# ) 1 / 28

2 And fast software: LU over (Z/65521Z) in 3.8s (21.8Gfops on 1 Haswell core) 2 / 28 High performance algebraic computing Domain of computation Z, Q : variable size, multi-precision Z p, GF(p k ): fixed size, specific arithmetic Common belief : Slow terrible complexities, no need for all the precision Example (Linear System solving over Q) Method Complexity Naive Gauss Elim over Q O (2 n ) Gauss mod det O ( n 5) Gauss mod p + CRT O ( n 4), O ( n ω+1) p-adic Lifting O ( n 3), O (n ω )

3 Gaussian elimination in computer algebra Applications Algebraic cryptanalysis: RSA, DLP LinSys, Krylov, F q Comp. number theory: modular forms databases: Echelon over F q Exact mixed-integer linear programming: LinSys over Q Formal proof: Sums of squares Cholesky over Q HPC building block Dense linear algebra over Z/pZ log 2 p bits MatlMul (fgemm) and GaussElim (PLUQ) triangular decomposition PLUQ (for LinSys, Det) linear dependencies (Krylov, Grobner basis) 3 / 28

4 FFLAS-FFPACK library FFLAS-FFPACK features High performance implementation of basic linear algebra routines over word size prime fields Exact alternative to the numerical BLAS library Exact triangularization, Sys. solving, Det, Inv., CharPoly Matrix multiplicaiton mod on Intel Sandy Bridge 2.6Ghz with OpenBLAS n 3 /10 9 /t fgemm+strassen dgemm fgemm n 4 / 28

5 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile 5 / 28

6 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) 5 / 28

7 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity 5 / 28

8 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity Pivoting strategies : no stability constraints, but rank profiles 5 / 28

9 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity Pivoting strategies : no stability constraints, but rank profiles Rank deficienices: blocks have unpredictable size ( and positions) unbalanced task load 5 / 28

10 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

11 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

12 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

13 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

14 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

15 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

16 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

17 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

18 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

19 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

20 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

21 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

22 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

23 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

24 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

25 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28

26 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28

27 Need for a high level parallel programming environment Features required Portability, Performance and Scalability. But more precisely: Runtime system with good performance for recursive tasks. Dataflow task synchronization LU(A11) ApplyP ApplyP ApplyP ApplyP FTRSM FTRSM FTRSM FTRSM (A12) (A21) (A13) (A31) LU(A11) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A13) ApplyP ApplyP FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) ApplyP ApplyP (A22) FTRSM (A32) FTRSM (A23) FGEMM (A23) FGEMM (A32) FGEMM (A33) FGEMM (A33) LU(A22) ApplyP ApplyP waiting for all tasks... LU(A33) Synchronization Time FTRSM (A32) FTRSM (A23) FGEMM (A33) LU(A33) Time Handle efficiently unbalanced workloads. Efficient range cutting for parallel for. 7 / 28

28 Need for a high level parallel programming environment Features required Portability, Performance and Scalability. But more precisely: Runtime system with good performance for recursive tasks. Dataflow task synchronization LU(A11) ApplyP ApplyP ApplyP ApplyP FTRSM FTRSM FTRSM FTRSM (A12) (A21) (A13) (A31) LU(A11) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A13) ApplyP ApplyP FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) ApplyP ApplyP (A22) FTRSM (A32) FTRSM (A23) FGEMM (A23) FGEMM (A32) FGEMM (A33) FGEMM (A33) LU(A22) ApplyP ApplyP waiting for all tasks... LU(A33) Synchronization Time FTRSM (A32) FTRSM (A23) FGEMM (A33) LU(A33) Time Handle efficiently unbalanced workloads. Efficient range cutting for parallel for. Wish to design a code independently from the runtime system Using runtime systems as a plugin 7 / 28

29 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 8 / 28

30 Runtime systems to be supported OpenMP3.x and 4.0 supported directives: (using libgomp) Data sharing attributes: OMP3 shared: data visible and accessible by all threads OMP3 firstprivate: local copy of original value OMP4 depend: set data dependencies Synchronization clauses: #pragma omp taskwait xkaapi: via the libkomp [BDG12] library: OpenMP directives xkaapi tasks. Re-implem. of task handling and management. Better recursive tasks execution. TBB: designed for nested and recursive parallelism parallel for tbb::task group, wait(), run() using C++11 lambda functions 9 / 28

31 PALADIn Parallel Algebraic Linear Algebra Dedicated Interface Mainly macro-based keywords No function call runtime overhead when using macros. No important modifications to be done to original program. Macros can be used also for C-based libraries. Complementary C++ template functions Implement the different cutting strategies. Store the iterators 10 / 28

32 PALADIn description: task parallelism Task parallelization: fork-join and dataflow models Example: PAR BLOCK: opens a parallel region. SYNCH GROUP: Group of tasks synchronized upon exit. TASK: creates a task. REFERENCE(args...): specify variables captured by reference. By default all variables accessed by value. READ(args...): set var. that are read only. WRITE(args...): set var. that are written only. READWRITE(args...): set var. that are read then written. void axpy ( const Element a, const Element b, Element &y ){ y += a x ; } SYNCH GROUP( TASK(MODE(READ( a, x ) READWRITE( y ) ), axpy ( a, x, y ) ) ; ) ; 11 / 28

33 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 12 / 28

34 Parallel matrix multiplication Iterative variants Fixed block size (FIXED, GRAIN) Better control of data mapping in memory Complexity: O(n 3 ) Fixed number of tasks (THREADS) Less control of data mapping in memory Complexity: O(n ω ) Recursive variants Almost no control of data mapping in memory Complexity: O(n ω ) or O(n 3 ) s A iterative A 1 A 2 t B C B 1 B 2 C 11 C 12 C 21 C 22 B 11 B 12 B 21 B 22 A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 recursive 13 / 28

35 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with OpenMP 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using OpenMP tasks 14 / 28

36 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with TBB 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using IntelTBB tasks 14 / 28

37 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with libkomp 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using XKaapi tasks 14 / 28

38 Parallel Matrix Multiplication: State of the art HPAC server: 32 cores Xeon E Ghz (4 NUMA sockets) Comparison of our best implementations with the state of the art numerical libraries effective Gfops MKL dgemm 100 OpenBlas dgemm PLASMA-QUARK dgemm BensonBallard (Strassen) matrix dimension 15 / 28

39 Parallel Matrix Multiplication: State of the art HPAC server: 32 cores Xeon E Ghz (4 NUMA sockets) # of field ops using classic matrix product Effective Gfops = time. Comparison of our best implementations with the state of the art numerical libraries effective Gfops n 3 peak performance on 32 cores WinogradPar->classicPar<double> 200 ClassicPar->WinogradSeq<double> MKL dgemm 100 OpenBlas dgemm PLASMA-QUARK dgemm BensonBallard (Strassen) matrix dimension 15 / 28

40 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 16 / 28

41 Parallel Triangular Solving Matrix Iterative variant: [ X1... X k ] L 1 [ B 1... B k ]. The computation of each X i L 1 B i is independent k sequential tasks set as the number of available threads Recursive variant: [ ] [ X1 L1 1: Split = X 2 L 2 L 3 ] 1 [ B1 B 2 ] 2: X 1 L 1 1 B 1 3: X 2 B 2 L 2 X 1 // Parallel MatMul 4: X 2 L 1 3 BX 2 17 / 28

42 Parallel Triangular Solving Matrix Iterative variant: [ X1... X k ] L 1 [ B 1... B k ]. The computation of each X i L 1 B i is independent k sequential tasks set as the number of available threads Recursive variant: [ ] [ ] 1 [ ] X1 L1 B1 1: Split = X 2 L 2 L 3 B 2 2: X 1 L 1 1 B 1 3: X 2 B 2 L 2 X 1 // Parallel MatMul 4: X 2 L 1 3 BX 2 Hybrid PFTRSM: column dimension of B small use iterative splitting in priority when #cols(x) < #proc: save some threads for recursive calls 17 / 28

43 Parallel Triangular Solving Matrix Experiments pftrsm on 32 cores Xeon E Ghz: solving LX=B where B is x n effective Gfops Hybrid threshold = 256 using libkomp 100 Hybrid threshold = 256 using libgomp Iterative using libkomp Iterative using libgomp 50 Intel-MKL dtrsm PLASMA-QUARK dtrsm n: column dimension of B Figure: Comparing the Iterative and the Hybrid variants for parallel FTRSM using libkomp and libgomp. Only the outer dimension varies: B and X are n. 18 / 28

44 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 19 / 28

45 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Split on one dimension: Row or Column slab cutting Slab iterative Slab recursive Split on 2 dimensions: Tile cutting Tile iterative Tile recursive 20 / 28

46 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Iterative: Static better data mapping control Dataflow parallel model less sync Recursive: Adaptive sub-cubic complexity No Dataflow more sync Slab iterative Tile iterative Slab recursive Tile recursive 20 / 28

47 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Iterative: Static better data mapping control Dataflow parallel model less sync Recursive: Adaptive sub-cubic complexity No Dataflow more sync Slab iterative Tile iterative Slab recursive Tile recursive 20 / 28

48 Parallel tile recursive PLUQ algorithm 2 2 block splitting 21 / 28

49 Parallel tile recursive PLUQ algorithm Recursive call 21 / 28

50 Parallel tile recursive PLUQ algorithm ptrsm: B BU 1 TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

51 Parallel tile recursive PLUQ algorithm ptrsm: B L 1 B TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

52 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

53 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

54 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

55 Parallel tile recursive PLUQ algorithm 2 independent recursive calls (concurrent tasks) TASK(MODE(READWRITE(A ) ), ppluq (..., A, lda ) ) ; 21 / 28

56 Parallel tile recursive PLUQ algorithm ptrsm: B BU 1 TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

57 Parallel tile recursive PLUQ algorithm ptrsm: B L 1 B TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28

58 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

59 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

60 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28

61 Parallel tile recursive PLUQ algorithm Recursive call 21 / 28

62 Parallel tile recursive PLUQ algorithm Puzzle game (block permutations) Tile rec: better data locality and more square blocks for M.M. 21 / 28

63 State of the art: exact vs numerical linear algebra State of the art comparison: Exact PLUQ using PALADIn language: best performance with xkaapi Numerical LU (dgetrf) of PLASMA-Quark and MKL dgetrf 400 parallel dgetrf vs parallel PLUQ on full rank matrices effective Gfops explicit synch pluq<double> MKL dgetrf 50 PLASMA-Quark dgetrf tiled storage (k=212) PLASMA-Quark dgetrf (k=212) matrix dimension 22 / 28

64 Performance of parallel PLUQ decomposition Low impact of modular reductions in parallel Efficient SIMD implementation 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch pluq rec<double> explicit synch pluq rec<131071> matrix dimension 23 / 28

65 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> explicit synch PLUQ iter<131071> matrix dimension 24 / 28

66 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> 50 dataflow synch PLUQ iter<131071> explicit synch PLUQ iter<131071> matrix dimension 24 / 28

67 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> dataflow synch PLUQ iter<131071> 50 dataflow synch PLUQ rec<131071> explicit synch PLUQ iter<131071> matrix dimension Possible improvement: implementation of the delegation of recursive tasks dependencies (Postpone access mode in the parallel programming environments) 24 / 28

68 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 25 / 28

69 Conclusion Lessons learnt for the parallelization of LU over Z/pZ Blocking impacts arithmetic cost fine granularity hurts Rank deficiency can offer more parallelism (cf. separators) sub-cubic perfs in parallel requires a runtime efficient for recursive tasks (XKaapi) 26 / 28

70 Perspectives Data flow task dependencies LU(A11) ApplyP ApplyP ApplyP FTRSM (A13) FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) (A22) FGEMM (A23) FGEMM (A32) FGEMM (A33) LU(A22) ApplyP ApplyP FTRSM FTRSM (A32) (A23) FGEMM (A33) LU(A33) Time already at use in tiled iterative algorithms (XKaapi) new challenges for recursive tasks: Recursive inclusion of sub-matrices Postponed modes (removing fake dependencies) Distributed on small sized clusters 27 / 28

71 Thank you 28 / 28

Adaptive Parallel Exact dense LU factorization

Adaptive Parallel Exact dense LU factorization 1/35 Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Université de Grenoble 2/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation

More information

Parallel computation of echelon forms

Parallel computation of echelon forms Parallel computation of echelon forms Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan To cite this version: Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan. Parallel

More information

Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries

Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries Clément Pernet & the LinBox group U. Joseph Fourier (Grenoble 1, Inria/LIP AriC) Séminaire Performance et Généricité,

More information

DOCTEUR DE L UNIVERSITÉ DE GRENOBLE

DOCTEUR DE L UNIVERSITÉ DE GRENOBLE THÈSE Pour obtenir le grade de DOCTEUR DE L UNIVERSITÉ DE GRENOBLE Spécialité : Mathématiques Appliquées Arrêté ministériel : 7 août 2006 Présentée par Ziad SULTAN Thèse dirigée par Jean-Guillaume DUMAS

More information

FFPACK: Finite Field Linear Algebra Package

FFPACK: Finite Field Linear Algebra Package FFPACK: Finite Field Linear Algebra Package Jean-Guillaume Dumas, Pascal Giorgi and Clément Pernet pascal.giorgi@ens-lyon.fr, {Jean.Guillaume.Dumas, Clément.Pernet}@imag.fr P. Giorgi, J-G. Dumas & C. Pernet

More information

From BLAS routines to finite field exact linear algebra solutions

From BLAS routines to finite field exact linear algebra solutions From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear

More information

Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack.

Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack. Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack. Brice Boyer LIP6, UPMC, France Joint work with Pascal Giorgi (LIRMM), Clément Pernet (LIG, ENSL), Bastien Vialla (LIRMM)

More information

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Computing the rank of big sparse matrices modulo p using gaussian elimination

Computing the rank of big sparse matrices modulo p using gaussian elimination Computing the rank of big sparse matrices modulo p using gaussian elimination Charles Bouillaguet 1 Claire Delaplace 2 12 CRIStAL, Université de Lille 2 IRISA, Université de Rennes 1 JNCF, 16 janvier 2017

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Lab: Scientific Computing Tsunami-Simulation

Lab: Scientific Computing Tsunami-Simulation Lab: Scientific Computing Tsunami-Simulation Session 4: Optimization and OMP Sebastian Rettenberger, Michael Bader 23.11.15 Session 4: Optimization and OMP, 23.11.15 1 Department of Informatics V Linux-Cluster

More information

qr_mumps a multithreaded, multifrontal QR solver The MUMPS team,

qr_mumps a multithreaded, multifrontal QR solver The MUMPS team, qr_mumps a multithreaded, multifrontal QR solver The MUMPS team, MUMPS Users Group Meeting, May 29-30, 2013 The qr_mumps software the qr_mumps software qr_mumps version 1.0 a multithreaded, multifrontal

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström

More information

Workpackage 5: High Performance Mathematical Computing

Workpackage 5: High Performance Mathematical Computing Clément Pernet: Workpackage 5 1 Brussels, April 26, 2017 Workpackage 5: High Performance Mathematical Computing Clément Pernet First OpenDreamKit Project review Brussels, April 26, 2017 Clément Pernet:

More information

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Allan Richmond R. Morales, Chong Tian, Kwai Wong, Eduardo D Azevedo The George Washington University, The Chinese University

More information

KAAPI : Adaptive Runtime System for Parallel Computing

KAAPI : Adaptive Runtime System for Parallel Computing KAAPI : Adaptive Runtime System for Parallel Computing Thierry Gautier, thierry.gautier@inrialpes.fr Bruno Raffin, bruno.raffin@inrialpes.fr, INRIA Grenoble Rhône-Alpes Moais Project http://moais.imag.fr

More information

! XKaapi : a runtime for highly parallel (OpenMP) application

! XKaapi : a runtime for highly parallel (OpenMP) application XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Structured Parallel Programming Patterns for Efficient Computation

Structured Parallel Programming Patterns for Efficient Computation Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

CS 470 Spring Mike Lam, Professor. Advanced OpenMP CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Placement de processus (MPI) sur architecture multi-cœur NUMA

Placement de processus (MPI) sur architecture multi-cœur NUMA Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (213) Published online in Wiley Online Library (wileyonlinelibrary.com)..311 Achieving numerical accuracy and high

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer HPC-Lab Session 2: OpenMP M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14

More information

Experimenting with the MetaFork Framework Targeting Multicores

Experimenting with the MetaFork Framework Targeting Multicores Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

Dense LU Factorization

Dense LU Factorization Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

Dense Matrix Multiplication

Dense Matrix Multiplication Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2

More information

OMPlab on Sun - Lab Programs

OMPlab on Sun - Lab Programs IWOMP 2005 - OMPlab on Sun 1/8 OMPlab on Sun - Lab Programs Introduction We have prepared a set of lab programs for you. We have an exercise to implement OpenMP in a simple algorithm and also various programs

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

NUMA-aware Multicore Matrix Multiplication

NUMA-aware Multicore Matrix Multiplication Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Parallel Linear Algebra in Julia

Parallel Linear Algebra in Julia Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

OpenMP Examples - Tasking

OpenMP Examples - Tasking Dipartimento di Ingegneria Industriale e dell Informazione University of Pavia December 4, 2017 Outline 1 2 Assignment 2: Quicksort Assignment 3: Jacobi Outline 1 2 Assignment 2: Quicksort Assignment 3:

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures

Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures 1 Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures Ichitaro Yamazaki, Jakub Kurzak, Panruo Wu, Mawussi Zounon, and Jack Dongarra Abstract Recently, the Open Multi-Processing

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information