Task based parallelization of recursive linear algebra routines using Kaapi
|
|
- Morris Ryan
- 5 years ago
- Views:
Transcription
1 Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée Runtime, Paris. Supported by OpenDreamKit Horizon 2020 European Research Infrastructures project (# ) 1 / 28
2 And fast software: LU over (Z/65521Z) in 3.8s (21.8Gfops on 1 Haswell core) 2 / 28 High performance algebraic computing Domain of computation Z, Q : variable size, multi-precision Z p, GF(p k ): fixed size, specific arithmetic Common belief : Slow terrible complexities, no need for all the precision Example (Linear System solving over Q) Method Complexity Naive Gauss Elim over Q O (2 n ) Gauss mod det O ( n 5) Gauss mod p + CRT O ( n 4), O ( n ω+1) p-adic Lifting O ( n 3), O (n ω )
3 Gaussian elimination in computer algebra Applications Algebraic cryptanalysis: RSA, DLP LinSys, Krylov, F q Comp. number theory: modular forms databases: Echelon over F q Exact mixed-integer linear programming: LinSys over Q Formal proof: Sums of squares Cholesky over Q HPC building block Dense linear algebra over Z/pZ log 2 p bits MatlMul (fgemm) and GaussElim (PLUQ) triangular decomposition PLUQ (for LinSys, Det) linear dependencies (Krylov, Grobner basis) 3 / 28
4 FFLAS-FFPACK library FFLAS-FFPACK features High performance implementation of basic linear algebra routines over word size prime fields Exact alternative to the numerical BLAS library Exact triangularization, Sys. solving, Det, Inv., CharPoly Matrix multiplicaiton mod on Intel Sandy Bridge 2.6Ghz with OpenBLAS n 3 /10 9 /t fgemm+strassen dgemm fgemm n 4 / 28
5 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile 5 / 28
6 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) 5 / 28
7 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity 5 / 28
8 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity Pivoting strategies : no stability constraints, but rank profiles 5 / 28
9 Exact vs numerical Gaussian elimination Similarities Reduction to gemm kernel (Matrix Multiplication) Blocking: slab/tiled, iterative/recursive Parallel blocking is constrained by pivoting numeric: ensuring numerical stability exact: able to reveal rank profile Specificities Recursive tasks (vs block iterative in numeric) } Modular reductions efficiency increases with the granularity Strassen s algorithm tradeoff between total work and fine granularity Pivoting strategies : no stability constraints, but rank profiles Rank deficienices: blocks have unpredictable size ( and positions) unbalanced task load 5 / 28
10 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
11 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
12 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
13 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
14 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
15 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
16 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
17 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
18 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
19 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
20 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
21 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
22 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
23 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
24 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
25 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U trsm: B BU 1,B L 1 B gemm: C C A B 6 / 28
26 Block algorithms Tiled Iterative Slab Recursive Tiled Recursive getrf: A L, U 6 / 28
27 Need for a high level parallel programming environment Features required Portability, Performance and Scalability. But more precisely: Runtime system with good performance for recursive tasks. Dataflow task synchronization LU(A11) ApplyP ApplyP ApplyP ApplyP FTRSM FTRSM FTRSM FTRSM (A12) (A21) (A13) (A31) LU(A11) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A13) ApplyP ApplyP FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) ApplyP ApplyP (A22) FTRSM (A32) FTRSM (A23) FGEMM (A23) FGEMM (A32) FGEMM (A33) FGEMM (A33) LU(A22) ApplyP ApplyP waiting for all tasks... LU(A33) Synchronization Time FTRSM (A32) FTRSM (A23) FGEMM (A33) LU(A33) Time Handle efficiently unbalanced workloads. Efficient range cutting for parallel for. 7 / 28
28 Need for a high level parallel programming environment Features required Portability, Performance and Scalability. But more precisely: Runtime system with good performance for recursive tasks. Dataflow task synchronization LU(A11) ApplyP ApplyP ApplyP ApplyP FTRSM FTRSM FTRSM FTRSM (A12) (A21) (A13) (A31) LU(A11) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A13) ApplyP ApplyP FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) ApplyP ApplyP (A22) FTRSM (A32) FTRSM (A23) FGEMM (A23) FGEMM (A32) FGEMM (A33) FGEMM (A33) LU(A22) ApplyP ApplyP waiting for all tasks... LU(A33) Synchronization Time FTRSM (A32) FTRSM (A23) FGEMM (A33) LU(A33) Time Handle efficiently unbalanced workloads. Efficient range cutting for parallel for. Wish to design a code independently from the runtime system Using runtime systems as a plugin 7 / 28
29 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 8 / 28
30 Runtime systems to be supported OpenMP3.x and 4.0 supported directives: (using libgomp) Data sharing attributes: OMP3 shared: data visible and accessible by all threads OMP3 firstprivate: local copy of original value OMP4 depend: set data dependencies Synchronization clauses: #pragma omp taskwait xkaapi: via the libkomp [BDG12] library: OpenMP directives xkaapi tasks. Re-implem. of task handling and management. Better recursive tasks execution. TBB: designed for nested and recursive parallelism parallel for tbb::task group, wait(), run() using C++11 lambda functions 9 / 28
31 PALADIn Parallel Algebraic Linear Algebra Dedicated Interface Mainly macro-based keywords No function call runtime overhead when using macros. No important modifications to be done to original program. Macros can be used also for C-based libraries. Complementary C++ template functions Implement the different cutting strategies. Store the iterators 10 / 28
32 PALADIn description: task parallelism Task parallelization: fork-join and dataflow models Example: PAR BLOCK: opens a parallel region. SYNCH GROUP: Group of tasks synchronized upon exit. TASK: creates a task. REFERENCE(args...): specify variables captured by reference. By default all variables accessed by value. READ(args...): set var. that are read only. WRITE(args...): set var. that are written only. READWRITE(args...): set var. that are read then written. void axpy ( const Element a, const Element b, Element &y ){ y += a x ; } SYNCH GROUP( TASK(MODE(READ( a, x ) READWRITE( y ) ), axpy ( a, x, y ) ) ; ) ; 11 / 28
33 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 12 / 28
34 Parallel matrix multiplication Iterative variants Fixed block size (FIXED, GRAIN) Better control of data mapping in memory Complexity: O(n 3 ) Fixed number of tasks (THREADS) Less control of data mapping in memory Complexity: O(n ω ) Recursive variants Almost no control of data mapping in memory Complexity: O(n ω ) or O(n 3 ) s A iterative A 1 A 2 t B C B 1 B 2 C 11 C 12 C 21 C 22 B 11 B 12 B 21 B 22 A 11 A 12 C 11 C 12 A 21 A 22 C 21 C 22 recursive 13 / 28
35 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with OpenMP 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using OpenMP tasks 14 / 28
36 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with TBB 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using IntelTBB tasks 14 / 28
37 Performance of pfgemm 500 pfgemm on 32 cores Xeon E Ghz with libkomp 400 Gfops iter(block-threads) rec(two-d) 100 rec(two-d-adapt) rec(three-d) rec(three-d-adapt) matrix dimension Figure: Speed of MatMul variants using XKaapi tasks 14 / 28
38 Parallel Matrix Multiplication: State of the art HPAC server: 32 cores Xeon E Ghz (4 NUMA sockets) Comparison of our best implementations with the state of the art numerical libraries effective Gfops MKL dgemm 100 OpenBlas dgemm PLASMA-QUARK dgemm BensonBallard (Strassen) matrix dimension 15 / 28
39 Parallel Matrix Multiplication: State of the art HPAC server: 32 cores Xeon E Ghz (4 NUMA sockets) # of field ops using classic matrix product Effective Gfops = time. Comparison of our best implementations with the state of the art numerical libraries effective Gfops n 3 peak performance on 32 cores WinogradPar->classicPar<double> 200 ClassicPar->WinogradSeq<double> MKL dgemm 100 OpenBlas dgemm PLASMA-QUARK dgemm BensonBallard (Strassen) matrix dimension 15 / 28
40 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 16 / 28
41 Parallel Triangular Solving Matrix Iterative variant: [ X1... X k ] L 1 [ B 1... B k ]. The computation of each X i L 1 B i is independent k sequential tasks set as the number of available threads Recursive variant: [ ] [ X1 L1 1: Split = X 2 L 2 L 3 ] 1 [ B1 B 2 ] 2: X 1 L 1 1 B 1 3: X 2 B 2 L 2 X 1 // Parallel MatMul 4: X 2 L 1 3 BX 2 17 / 28
42 Parallel Triangular Solving Matrix Iterative variant: [ X1... X k ] L 1 [ B 1... B k ]. The computation of each X i L 1 B i is independent k sequential tasks set as the number of available threads Recursive variant: [ ] [ ] 1 [ ] X1 L1 B1 1: Split = X 2 L 2 L 3 B 2 2: X 1 L 1 1 B 1 3: X 2 B 2 L 2 X 1 // Parallel MatMul 4: X 2 L 1 3 BX 2 Hybrid PFTRSM: column dimension of B small use iterative splitting in priority when #cols(x) < #proc: save some threads for recursive calls 17 / 28
43 Parallel Triangular Solving Matrix Experiments pftrsm on 32 cores Xeon E Ghz: solving LX=B where B is x n effective Gfops Hybrid threshold = 256 using libkomp 100 Hybrid threshold = 256 using libgomp Iterative using libkomp Iterative using libgomp 50 Intel-MKL dtrsm PLASMA-QUARK dtrsm n: column dimension of B Figure: Comparing the Iterative and the Hybrid variants for parallel FTRSM using libkomp and libgomp. Only the outer dimension varies: B and X are n. 18 / 28
44 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 19 / 28
45 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Split on one dimension: Row or Column slab cutting Slab iterative Slab recursive Split on 2 dimensions: Tile cutting Tile iterative Tile recursive 20 / 28
46 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Iterative: Static better data mapping control Dataflow parallel model less sync Recursive: Adaptive sub-cubic complexity No Dataflow more sync Slab iterative Tile iterative Slab recursive Tile recursive 20 / 28
47 Gaussian elimination design Reducing to MatMul: block versions Asymptotically faster (O(n ω )) Better cache efficiency Variants of block versions Iterative: Static better data mapping control Dataflow parallel model less sync Recursive: Adaptive sub-cubic complexity No Dataflow more sync Slab iterative Tile iterative Slab recursive Tile recursive 20 / 28
48 Parallel tile recursive PLUQ algorithm 2 2 block splitting 21 / 28
49 Parallel tile recursive PLUQ algorithm Recursive call 21 / 28
50 Parallel tile recursive PLUQ algorithm ptrsm: B BU 1 TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28
51 Parallel tile recursive PLUQ algorithm ptrsm: B L 1 B TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28
52 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28
53 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28
54 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28
55 Parallel tile recursive PLUQ algorithm 2 independent recursive calls (concurrent tasks) TASK(MODE(READWRITE(A ) ), ppluq (..., A, lda ) ) ; 21 / 28
56 Parallel tile recursive PLUQ algorithm ptrsm: B BU 1 TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28
57 Parallel tile recursive PLUQ algorithm ptrsm: B L 1 B TASK(MODE(READ(A) READWRITE(B ) ), pftrsm (..., A, lda, B, ldb ) ) ; 21 / 28
58 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28
59 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28
60 Parallel tile recursive PLUQ algorithm pfgemm: C C A B TASK(MODE(READ(A,B) READWRITE(C) ), pfgemm (..., A, lda, B, ldb ) ) ; 21 / 28
61 Parallel tile recursive PLUQ algorithm Recursive call 21 / 28
62 Parallel tile recursive PLUQ algorithm Puzzle game (block permutations) Tile rec: better data locality and more square blocks for M.M. 21 / 28
63 State of the art: exact vs numerical linear algebra State of the art comparison: Exact PLUQ using PALADIn language: best performance with xkaapi Numerical LU (dgetrf) of PLASMA-Quark and MKL dgetrf 400 parallel dgetrf vs parallel PLUQ on full rank matrices effective Gfops explicit synch pluq<double> MKL dgetrf 50 PLASMA-Quark dgetrf tiled storage (k=212) PLASMA-Quark dgetrf (k=212) matrix dimension 22 / 28
64 Performance of parallel PLUQ decomposition Low impact of modular reductions in parallel Efficient SIMD implementation 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch pluq rec<double> explicit synch pluq rec<131071> matrix dimension 23 / 28
65 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> explicit synch PLUQ iter<131071> matrix dimension 24 / 28
66 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> 50 dataflow synch PLUQ iter<131071> explicit synch PLUQ iter<131071> matrix dimension 24 / 28
67 Performance of task parallelism: dataflow model 400 Performance of tile PLUQ recursive vs iterative on full rank matrices effective Gfops explicit synch PLUQ rec<131071> dataflow synch PLUQ iter<131071> 50 dataflow synch PLUQ rec<131071> explicit synch PLUQ iter<131071> matrix dimension Possible improvement: implementation of the delegation of recursive tasks dependencies (Postpone access mode in the parallel programming environments) 24 / 28
68 Outline 1 Runtime systems 2 Matrix Multiplication 3 TRSM 4 Parallel exact Gaussian elimination 25 / 28
69 Conclusion Lessons learnt for the parallelization of LU over Z/pZ Blocking impacts arithmetic cost fine granularity hurts Rank deficiency can offer more parallelism (cf. separators) sub-cubic perfs in parallel requires a runtime efficient for recursive tasks (XKaapi) 26 / 28
70 Perspectives Data flow task dependencies LU(A11) ApplyP ApplyP ApplyP FTRSM (A13) FTRSM FTRSM (A21) (A12) FGEMM ApplyP FTRSM (A31) (A22) FGEMM (A23) FGEMM (A32) FGEMM (A33) LU(A22) ApplyP ApplyP FTRSM FTRSM (A32) (A23) FGEMM (A33) LU(A33) Time already at use in tiled iterative algorithms (XKaapi) new challenges for recursive tasks: Recursive inclusion of sub-matrices Postponed modes (removing fake dependencies) Distributed on small sized clusters 27 / 28
71 Thank you 28 / 28
Adaptive Parallel Exact dense LU factorization
1/35 Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Université de Grenoble 2/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation
More informationParallel computation of echelon forms
Parallel computation of echelon forms Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan To cite this version: Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan. Parallel
More informationGenericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries
Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries Clément Pernet & the LinBox group U. Joseph Fourier (Grenoble 1, Inria/LIP AriC) Séminaire Performance et Généricité,
More informationDOCTEUR DE L UNIVERSITÉ DE GRENOBLE
THÈSE Pour obtenir le grade de DOCTEUR DE L UNIVERSITÉ DE GRENOBLE Spécialité : Mathématiques Appliquées Arrêté ministériel : 7 août 2006 Présentée par Ziad SULTAN Thèse dirigée par Jean-Guillaume DUMAS
More informationFFPACK: Finite Field Linear Algebra Package
FFPACK: Finite Field Linear Algebra Package Jean-Guillaume Dumas, Pascal Giorgi and Clément Pernet pascal.giorgi@ens-lyon.fr, {Jean.Guillaume.Dumas, Clément.Pernet}@imag.fr P. Giorgi, J-G. Dumas & C. Pernet
More informationFrom BLAS routines to finite field exact linear algebra solutions
From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear
More informationMultiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack.
Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack. Brice Boyer LIP6, UPMC, France Joint work with Pascal Giorgi (LIRMM), Clément Pernet (LIG, ENSL), Bastien Vialla (LIRMM)
More informationOverview: Emerging Parallel Programming Models
Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationComputing the rank of big sparse matrices modulo p using gaussian elimination
Computing the rank of big sparse matrices modulo p using gaussian elimination Charles Bouillaguet 1 Claire Delaplace 2 12 CRIStAL, Université de Lille 2 IRISA, Université de Rennes 1 JNCF, 16 janvier 2017
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationBLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg
University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationModule 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program
The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationComparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015
Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationLab: Scientific Computing Tsunami-Simulation
Lab: Scientific Computing Tsunami-Simulation Session 4: Optimization and OMP Sebastian Rettenberger, Michael Bader 23.11.15 Session 4: Optimization and OMP, 23.11.15 1 Department of Informatics V Linux-Cluster
More informationqr_mumps a multithreaded, multifrontal QR solver The MUMPS team,
qr_mumps a multithreaded, multifrontal QR solver The MUMPS team, MUMPS Users Group Meeting, May 29-30, 2013 The qr_mumps software the qr_mumps software qr_mumps version 1.0 a multithreaded, multifrontal
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationUsing recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden
Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström
More informationWorkpackage 5: High Performance Mathematical Computing
Clément Pernet: Workpackage 5 1 Brussels, April 26, 2017 Workpackage 5: High Performance Mathematical Computing Clément Pernet First OpenDreamKit Project review Brussels, April 26, 2017 Clément Pernet:
More informationMetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores
MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationRuntime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System
Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Allan Richmond R. Morales, Chong Tian, Kwai Wong, Eduardo D Azevedo The George Washington University, The Chinese University
More informationKAAPI : Adaptive Runtime System for Parallel Computing
KAAPI : Adaptive Runtime System for Parallel Computing Thierry Gautier, thierry.gautier@inrialpes.fr Bruno Raffin, bruno.raffin@inrialpes.fr, INRIA Grenoble Rhône-Alpes Moais Project http://moais.imag.fr
More information! XKaapi : a runtime for highly parallel (OpenMP) application
XKaapi : a runtime for highly parallel (OpenMP) application Thierry Gautier thierry.gautier@inrialpes.fr MOAIS, INRIA, Grenoble C2S@Exa 10-11/07/2014 Agenda - context - objective - task model and execution
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationStructured Parallel Programming Patterns for Efficient Computation
Structured Parallel Programming Patterns for Efficient Computation Michael McCool Arch D. Robison James Reinders ELSEVIER AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO
More informationParallel Programming
Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationCS 470 Spring Mike Lam, Professor. Advanced OpenMP
CS 470 Spring 2018 Mike Lam, Professor Advanced OpenMP Atomics OpenMP provides access to highly-efficient hardware synchronization mechanisms Use the atomic pragma to annotate a single statement Statement
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationLecture 4: OpenMP Open Multi-Processing
CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationOpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,
More informationPlacement de processus (MPI) sur architecture multi-cœur NUMA
Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationGenerating Optimized Sparse Matrix Vector Product over Finite Fields
Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.
More informationAchieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (213) Published online in Wiley Online Library (wileyonlinelibrary.com)..311 Achieving numerical accuracy and high
More informationECE 574 Cluster Computing Lecture 10
ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationDepartment of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer
HPC-Lab Session 2: OpenMP M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14
More informationExperimenting with the MetaFork Framework Targeting Multicores
Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationMaking Dataflow Programming Ubiquitous for Scientific Computing
Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationToward a supernodal sparse direct solver over DAG runtimes
Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX
More informationDense LU Factorization
Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationOpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen
OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationDense Matrix Multiplication
Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2
More informationOMPlab on Sun - Lab Programs
IWOMP 2005 - OMPlab on Sun 1/8 OMPlab on Sun - Lab Programs Introduction We have prepared a set of lab programs for you. We have an exercise to implement OpenMP in a simple algorithm and also various programs
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationNUMA-aware Multicore Matrix Multiplication
Parallel Processing Letters c World Scientific Publishing Company NUMA-aware Multicore Matrix Multiplication WAIL Y. ALKOWAILEET Department of Computer Science (Systems), University of California, Irvine,
More informationSparse Direct Solvers for Extreme-Scale Computing
Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationGPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU
April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationComparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015
Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015 Abstract As both an OpenMP and OpenACC insider I will present my opinion of the current status of these two directive sets for programming
More informationOverview: The OpenMP Programming Model
Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP
More informationParallel Linear Algebra in Julia
Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationOpenMP Examples - Tasking
Dipartimento di Ingegneria Industriale e dell Informazione University of Pavia December 4, 2017 Outline 1 2 Assignment 2: Quicksort Assignment 3: Jacobi Outline 1 2 Assignment 2: Quicksort Assignment 3:
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationSymmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
1 Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures Ichitaro Yamazaki, Jakub Kurzak, Panruo Wu, Mawussi Zounon, and Jack Dongarra Abstract Recently, the Open Multi-Processing
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More information