Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Size: px
Start display at page:

Download "Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends"

Transcription

1 Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

2 1 Setting the stage 2 Part 1: blocked algorithms 3 Part 2: multithreading, fork-join 4 Part 3: multihtreading, algorithms-by-blocks 5 Part 4: streaming Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34

3 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

4 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

5 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

6 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

7 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

8 Dense Linear Algebra Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

9 Dense Linear Algebra Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Support routines factorizations, reductions,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

10 Dense Linear Algebra Matrix equations AX + XB = C, A = A+A 1 2,... Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Support routines factorizations, reductions,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

11 Organization in layers Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

12 Organization in layers BLAS Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

13 Organization in layers BLAS BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

14 Organization in layers BLAS BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

15 Organization in layers BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

16 Organization in layers LAPACK LU = A, LL T = A, QR = A, Q T T Q = A,... BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

17 Organization in layers other libraries ScaLAPACK, Elemental, PETSc,... LAPACK LU = A, LL T = A, QR = A, Q T T Q = A,... BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

18 Example: AX = B (full A) AX = B Linear System LU = A LU Factorization LX = B Triangular System LX = B Triangular System C = AB + C Gemm C = AB + C Gemm C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34

19 Why BLAS-3? Why GEMM? Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

20 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

21 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

22 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

23 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

24 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

25 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 3 2n 3 4n 2 n/2 Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

26 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 3 2n 3 4n 2 n/2 Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK. Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

27 Part 1: Blocked algorithms Simple example: Cholesky factorization Input: Goal: Matrix A, symmetric and positive definite. Determine L (lower triangular matrix) such that LL T = A L = ( LT L L BL L BR ) =? Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34

28 Cholesky factorization iteration i DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

29 Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm sqrt CHOL trsv syr TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

30 Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm DONE DONE DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

31 Cholesky: unblocked vs. blocked algorithms Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34

32 Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

33 Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL,... Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

34 Multithreaded BLAS Xeon, 32 physical cores, MKL 1 Efficiency of GEMM 0.8 Efficiency Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34

35 Development of LA libraries - New architecture / new architectural features Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

36 Development of LA libraries - New architecture / new architectural features - GEMM Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

37 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

38 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

39 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

40 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

41 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

42 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

43 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines. Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

44 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines.. - eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

45 Examples : multicores GEMM, mtblas HPL : Blue Gene L dual cores (PowerPC A2) : Roadrunner dual cores (Opteron) 2009: Jaguar 6-cores (Opteron) : K Computer 8-cores SPARC64... libflame, PLASMA, MKL Eigensolvers: 2010 Paolo Bientinesi (AICES, RWTH Aachen) 14 / 34

46 Examples : Cell GEMM: 99% no BLAS [FFT] HPL : Roadrunner >1 PetaFLOP 6k * (Opteron + 2 Cell) no libs 2009: discontinued Eigensolvers: Paolo Bientinesi (AICES, RWTH Aachen) 15 / 34

47 Examples 2005: GPGPUs cublas (*) HPL 2010: Tianhe-1A 2.5k dual-gpu (ATI Radeon) CULA (*) libflame, MAGMA (multigpu) Eigensolvers: Paolo Bientinesi (AICES, RWTH Aachen) 16 / 34

48 Present Intel Xeon Phi (MIC) GEMM (*), MKL HPL (predicted) 2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi) Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

49 Present Intel Xeon Phi (MIC) GEMM (*), MKL HPL (predicted) 2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi) Also ARM, FPGAs, DSPs,... Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

50 Future 2017: Exascale (?)??? heterogeneous architecture Paolo Bientinesi (AICES, RWTH Aachen) 18 / 34

51 Back to the algorithms: Chol. fact Fork-join unnecessary synchronizations CHOL CHOL TRSM SYRK TRSM SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) 19 / 34

52 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X A = B Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

53 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X X A 0 B 0 A = B = A 1 B 1 A 2 B 2 Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

54 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X X A 0 B 0 A 0 X = B 0 A = B = A 1 B 1 A 1 X = B 1 A 2 B 2 A 2 X = B 2 Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

55 Storage by blocks CHOL CHOL TRSM SYRK TRSM SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) 21 / 34

56 Creating tasks Iteration 1 CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

57 Creating tasks Iteration 2 CHOL TRSM SYRK TRSM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

58 Creating tasks Iteration 3 CHOL TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

59 Creating tasks Iteration 4 CHOL Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

60 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

61 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

62 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

63 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

64 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

65 DAG - Dependencies 4 4-tile matrix CHOL TRSM TRSM TRSM SYRK GEMM GEMM SYRK GEMM SYRK CHOL TRSM TRSM 3 3 SYRK GEMM SYRK CHOL TRSM SYRK CHOL Paolo Bientinesi (AICES, RWTH Aachen) 24 / 34

66 Task Execution 4 4-tile matrix Stage Scheduled Tasks 1 CHOL 2 TRSM TRSM TRSM TRSM 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 TRSM TRSM TRSM 7 SYRK GEMM SYRK GEMM 8 GEMM SYRK CHOL 9 TRSM TRSM 10 SYRK GEMM SYRK 11 CHOL 12 TRSM 13 SYRK 14 CHOL Paolo Bientinesi (AICES, RWTH Aachen) 25 / 34

67 SPD Inverse: Chol+Inv+GEMM 5 5-tile matrix Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

68 SPD Inverse: Chol+Inv+GEMM 5 5-tile matrix Stage Scheduled Tasks 1 CHOL 2 TRSM TRSM TRSM TRSM 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL TRSM 6 TRSM TRSM TRSM TRSM 7 TRSM TRSM TRINV SYRK 8 GEMM SYRK GEMM GEMM 9 SYRK TTMM CHOL TRSM 10 TRSM TRSM TRSM TRSM 11 GEMM GEMM GEMM SYRK 12 GEMM SYRK TRSM CHOL 13 TRSM TRSM TRINV SYRK 14 TRSM GEMM GEMM GEMM 15 GEMM TRMM SYRK TRSM 16 TRSM TTMM CHOL TRSM 17 SYRK TRINV GEMM SYRK 18 GEMM GEMM GEMM TRMM 19 TRMM TRSM TRSM TRSM 20 TRSM TRSM TRSM TRSM 21 TTMM SYRK GEMM SYRK 22 TRINV GEMM GEMM TRINV 23 SYRK SYRK GEMM SYRK 24 TRMM GEMM TRMM GEMM 25 TRMM SYRK GEMM GEMM 26 TTMM GEMM TRMM TRMM 27 SYRK TRMM 28 TRMM 29 TTMM Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

69 Blocked vs. By-blocks Paolo Bientinesi (AICES, RWTH Aachen) 27 / 34

70 Blocked vs. By-blocks: crossover 80 Performance of Cholesky ( LL T = A ) GFLOPS SuperMatrix FLAME Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) 28 / 34

71 Multithreaded BLAS vs. Algorithms-by-blocks No absolute winner: crossover! Ease of use Synchronization Out of order execution Parallelism dictated by data dependencies Plateux libflame, MKL, PLASMA,... Heterogeneous systems schedulers Paolo Bientinesi (AICES, RWTH Aachen) 29 / 34

72 Part 4: Streaming The entire problem does not fit even in main memory Example b := ( X T M 1 X ) 1 X T M 1 y Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

73 Part 4: Streaming The entire problem does not fit even in main memory Example b := ( X T M 1 X ) 1 X T M 1 y Linear regression with non-independent outcomes Inputs M R n n, SP D(M), n [10 3,..., 10 4 ] X R n p, p [1,..., 20], full rank y R n Output b R p Sequence of thousands of problems Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

74 Algorithm LL T = M X := L 1 X S := X T X GG T = S y := L 1 y b := X T y b := G 1 b b := G T b Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

75 Algorithm bottleneck? LL T = M X := L 1 X to accelerator (TRSM) S := X T X GG T = S y := L 1 y b := X T y b := G 1 b b := G T b Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

76 double+triple buffering GPU GPU trsm b GPU trsm b+1 Send b+1 Send b+2 CPU Recv b-1 CPU comp b-1 Recv b CPU b t HDD Read b+2 Write b-1 Read b+3 CPU GPU transfer HDD CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

77 double+triple buffering GPU b-1 β trsm b α b-1 CPU/RAM b+1 b-1 b+2 C B A HDD Data X b-3 b b-2 b-1 b+1 b+2 b+3 Results r b-3 b-2 b-1 Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

78 Summary Dense linear algebra: it s matter of layers GEMM = foundations, THE building block Irrespective of the architecture, need for highly optimized BLAS How to use threads/cores? Multithreaded BLAS vs. algorithms-by-blocks Large/many problems but limited memory streaming Paolo Bientinesi (AICES, RWTH Aachen) 33 / 34

79 References MKL cublas, CULA libflame BLIS OpenBLAS Magma Plasma Paolo Bientinesi (AICES, RWTH Aachen) 34 / 34

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1 How Heterogeneous? 2 How Many Languages? 3 How Many Languages? 3 Question! 4 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/20/13 1 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 National

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note #5 Ernie Chan Department of Computer Science The University of Texas at Austin Austin, Texas

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Streaming Data from HDD to GPUs for Sustained Peak Performance

Streaming Data from HDD to GPUs for Sustained Peak Performance Streaming Data from HDD to GPUs for Sustained Peak Performance Lucas Beyer and Paolo Bientinesi RWTH Aachen University, Aachen Institute for advanced study in Computational Engineering Science, Germany

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Parallel Programming

Parallel Programming Parallel Programming Timings, pt.3 Prof. Paolo Bientinesi HPAC, RWTH Aachen pauldj@aices.rwth-aachen.de WS17/18 Time Wall time or wall-clock time : real time between the beginning and the end of a computation

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Parallel Linear Algebra in Julia

Parallel Linear Algebra in Julia Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you

More information

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to

More information

An Overview of High Performance Computing and Challenges for the Future

An Overview of High Performance Computing and Challenges for the Future An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Jack Dongarra University of Tennessee Oak Ridge National Laboratory Jack Dongarra University of Tennessee Oak Ridge National Laboratory 3/9/11 1 TPP performance Rate Size 2 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000

More information

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Francisco D. Igual Gregorio Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad Jaume I 12.071 Castellón Spain {figualgquintan}@icc.uji.es

More information

Technology on Dense Linear Algebra

Technology on Dense Linear Algebra Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Max Planck Institute Magdeburg Preprints

Max Planck Institute Magdeburg Preprints Peter Benner Pablo Ezzatti Enrique S. Quintana-Ortí Alfredo Remón Matrix Inversion on CPU-GPU Platforms with Applications in Control Theory MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER SYSTEME

More information

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,

More information

Unleashing the Power of Multi-GPU Accelerators with FLAME

Unleashing the Power of Multi-GPU Accelerators with FLAME Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010

More information

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix Acceleration of Hessenberg Reduction for Nonsymmetric Matrix by Hesamaldin Nekouei Bachelor of Science Degree in Electrical Engineering Iran University of Science and Technology, Iran, 2009 A thesis presented

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

A Multi-Tiered Optimization Framework for Heterogeneous Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Allan Richmond R. Morales, Chong Tian, Kwai Wong, Eduardo D Azevedo The George Washington University, The Chinese University

More information

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Emmanuel Agullo 1, Henricus Bouwmeester 2, Jack Dongarra 1, Jakub Kurzak 1, Julien Langou 2,andLeeRosenberg

More information

Performance Modeling for Ranking Blocked Algorithms

Performance Modeling for Ranking Blocked Algorithms Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science 27.4.2012 Elmar Peise (AICES) Performance Modeling 27.4.2012 1 Blocked

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

A GPU Sparse Direct Solver for AX=B

A GPU Sparse Direct Solver for AX=B 1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks

More information

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012 Cray Scientific Libraries: Overview and Performance Cray XE6 Performance Workshop University of Reading 20-22 Nov 2012 Contents LibSci overview and usage BFRAME / CrayBLAS LAPACK ScaLAPACK FFTW / CRAFFT

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

GPU Programming. Ringberg Theorie Seminar 2010

GPU Programming. Ringberg Theorie Seminar 2010 or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render

More information

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg

More information

Parallel Programming

Parallel Programming Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias

More information

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Programming Dense Linear Algebra Kernels on Vectorized Architectures University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence

More information

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

Mixed Precision Methods

Mixed Precision Methods Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Task based parallelization of recursive linear algebra routines using Kaapi

Task based parallelization of recursive linear algebra routines using Kaapi Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) Michael Chuvelev, Intel Corporation, michael.chuvelev@intel.com Sergey Kazakov, Intel Corporation sergey.kazakov@intel.com

More information

Quantum ESPRESSO on GPU accelerated systems

Quantum ESPRESSO on GPU accelerated systems Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Chapter 3 Dense Linear Algebra for Hybrid GPU-Based Systems Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Department of Electrical Engineering

More information

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design 1 Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Senior Member, IEEE, Soumyendu Raha, S K Nandy, Senior Member,

More information

Sparse Matrix Algorithms

Sparse Matrix Algorithms Sparse Matrix Algorithms combinatorics + numerical methods + applications Math + X Tim Davis University of Florida June 2013 contributions to the field current work vision for the future Outline Math+X

More information

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information