Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
|
|
- Emmeline Gardner
- 6 years ago
- Views:
Transcription
1 Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34
2 1 Setting the stage 2 Part 1: blocked algorithms 3 Part 2: multithreading, fork-join 4 Part 3: multihtreading, algorithms-by-blocks 5 Part 4: streaming Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34
3 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
4 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
5 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
6 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
7 Dense Linear Algebra M = Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
8 Dense Linear Algebra Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34
9 Dense Linear Algebra Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Support routines factorizations, reductions,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34
10 Dense Linear Algebra Matrix equations AX + XB = C, A = A+A 1 2,... Linear systems Ax = b, AX = B, least squares,... Eigenproblems Ax = λx, AX = BXΛ, SVD,... Support routines factorizations, reductions,... Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34
11 Organization in layers Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
12 Organization in layers BLAS Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
13 Organization in layers BLAS BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
14 Organization in layers BLAS BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
15 Organization in layers BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
16 Organization in layers LAPACK LU = A, LL T = A, QR = A, Q T T Q = A,... BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
17 Organization in layers other libraries ScaLAPACK, Elemental, PETSc,... LAPACK LU = A, LL T = A, QR = A, Q T T Q = A,... BLAS BLAS-3: C := C + AB A, B, C, L R n n C := L 1 B BLAS-2: y := y + Ax A, L R n n, x, y R n y := L 1 x BLAS-1: y := y + αx x, y R n, α R dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
18 Example: AX = B (full A) AX = B Linear System LU = A LU Factorization LX = B Triangular System LX = B Triangular System C = AB + C Gemm C = AB + C Gemm C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34
19 Why BLAS-3? Why GEMM? Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
20 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
21 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
22 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
23 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
24 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
25 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 3 2n 3 4n 2 n/2 Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
26 Why BLAS-3? Why GEMM? BLAS #FLOPS Mem. refs. Ratio Level 3 2n 3 4n 2 n/2 Level 2 2n 2 n 2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, R n n BLAS-2: y := y + Ax A R n n, x, y R n BLAS-1: y := y + αx x, y R n, α R Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK. Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
27 Part 1: Blocked algorithms Simple example: Cholesky factorization Input: Goal: Matrix A, symmetric and positive definite. Determine L (lower triangular matrix) such that LL T = A L = ( LT L L BL L BR ) =? Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34
28 Cholesky factorization iteration i DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34
29 Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm sqrt CHOL trsv syr TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34
30 Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm DONE DONE DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34
31 Cholesky: unblocked vs. blocked algorithms Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34
32 Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34
33 Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL,... Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34
34 Multithreaded BLAS Xeon, 32 physical cores, MKL 1 Efficiency of GEMM 0.8 Efficiency Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34
35 Development of LA libraries - New architecture / new architectural features Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
36 Development of LA libraries - New architecture / new architectural features - GEMM Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
37 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
38 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
39 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
40 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
41 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
42 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
43 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines. Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
44 Development of LA libraries - New architecture / new architectural features - GEMM peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B LINPACK benchmark - BLAS3, factorizations, AX=B - factorizations, AX=B - factorizations, AX=B, support routines.. - eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
45 Examples : multicores GEMM, mtblas HPL : Blue Gene L dual cores (PowerPC A2) : Roadrunner dual cores (Opteron) 2009: Jaguar 6-cores (Opteron) : K Computer 8-cores SPARC64... libflame, PLASMA, MKL Eigensolvers: 2010 Paolo Bientinesi (AICES, RWTH Aachen) 14 / 34
46 Examples : Cell GEMM: 99% no BLAS [FFT] HPL : Roadrunner >1 PetaFLOP 6k * (Opteron + 2 Cell) no libs 2009: discontinued Eigensolvers: Paolo Bientinesi (AICES, RWTH Aachen) 15 / 34
47 Examples 2005: GPGPUs cublas (*) HPL 2010: Tianhe-1A 2.5k dual-gpu (ATI Radeon) CULA (*) libflame, MAGMA (multigpu) Eigensolvers: Paolo Bientinesi (AICES, RWTH Aachen) 16 / 34
48 Present Intel Xeon Phi (MIC) GEMM (*), MKL HPL (predicted) 2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi) Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34
49 Present Intel Xeon Phi (MIC) GEMM (*), MKL HPL (predicted) 2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi) Also ARM, FPGAs, DSPs,... Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34
50 Future 2017: Exascale (?)??? heterogeneous architecture Paolo Bientinesi (AICES, RWTH Aachen) 18 / 34
51 Back to the algorithms: Chol. fact Fork-join unnecessary synchronizations CHOL CHOL TRSM SYRK TRSM SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) 19 / 34
52 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X A = B Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34
53 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X X A 0 B 0 A = B = A 1 B 1 A 2 B 2 Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34
54 Part 3: Parallelism? algorithms-by-blocks Idea: decompose the tasks Solution #2: dependent tasks + scheduling X X A 0 B 0 A 0 X = B 0 A = B = A 1 B 1 A 1 X = B 1 A 2 B 2 A 2 X = B 2 Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34
55 Storage by blocks CHOL CHOL TRSM SYRK TRSM SYRK Iteration 1 Iteration 2 Paolo Bientinesi (AICES, RWTH Aachen) 21 / 34
56 Creating tasks Iteration 1 CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34
57 Creating tasks Iteration 2 CHOL TRSM SYRK TRSM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34
58 Creating tasks Iteration 3 CHOL TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34
59 Creating tasks Iteration 4 CHOL Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34
60 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34
61 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34
62 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34
63 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34
64 Dependencies CHOL TRSM SYRK TRSM GEMM SYRK TRSM GEMM GEMM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34
65 DAG - Dependencies 4 4-tile matrix CHOL TRSM TRSM TRSM SYRK GEMM GEMM SYRK GEMM SYRK CHOL TRSM TRSM 3 3 SYRK GEMM SYRK CHOL TRSM SYRK CHOL Paolo Bientinesi (AICES, RWTH Aachen) 24 / 34
66 Task Execution 4 4-tile matrix Stage Scheduled Tasks 1 CHOL 2 TRSM TRSM TRSM TRSM 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL 6 TRSM TRSM TRSM 7 SYRK GEMM SYRK GEMM 8 GEMM SYRK CHOL 9 TRSM TRSM 10 SYRK GEMM SYRK 11 CHOL 12 TRSM 13 SYRK 14 CHOL Paolo Bientinesi (AICES, RWTH Aachen) 25 / 34
67 SPD Inverse: Chol+Inv+GEMM 5 5-tile matrix Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34
68 SPD Inverse: Chol+Inv+GEMM 5 5-tile matrix Stage Scheduled Tasks 1 CHOL 2 TRSM TRSM TRSM TRSM 3 SYRK GEMM SYRK GEMM 4 GEMM SYRK GEMM GEMM 5 GEMM SYRK CHOL TRSM 6 TRSM TRSM TRSM TRSM 7 TRSM TRSM TRINV SYRK 8 GEMM SYRK GEMM GEMM 9 SYRK TTMM CHOL TRSM 10 TRSM TRSM TRSM TRSM 11 GEMM GEMM GEMM SYRK 12 GEMM SYRK TRSM CHOL 13 TRSM TRSM TRINV SYRK 14 TRSM GEMM GEMM GEMM 15 GEMM TRMM SYRK TRSM 16 TRSM TTMM CHOL TRSM 17 SYRK TRINV GEMM SYRK 18 GEMM GEMM GEMM TRMM 19 TRMM TRSM TRSM TRSM 20 TRSM TRSM TRSM TRSM 21 TTMM SYRK GEMM SYRK 22 TRINV GEMM GEMM TRINV 23 SYRK SYRK GEMM SYRK 24 TRMM GEMM TRMM GEMM 25 TRMM SYRK GEMM GEMM 26 TTMM GEMM TRMM TRMM 27 SYRK TRMM 28 TRMM 29 TTMM Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34
69 Blocked vs. By-blocks Paolo Bientinesi (AICES, RWTH Aachen) 27 / 34
70 Blocked vs. By-blocks: crossover 80 Performance of Cholesky ( LL T = A ) GFLOPS SuperMatrix FLAME Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) 28 / 34
71 Multithreaded BLAS vs. Algorithms-by-blocks No absolute winner: crossover! Ease of use Synchronization Out of order execution Parallelism dictated by data dependencies Plateux libflame, MKL, PLASMA,... Heterogeneous systems schedulers Paolo Bientinesi (AICES, RWTH Aachen) 29 / 34
72 Part 4: Streaming The entire problem does not fit even in main memory Example b := ( X T M 1 X ) 1 X T M 1 y Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34
73 Part 4: Streaming The entire problem does not fit even in main memory Example b := ( X T M 1 X ) 1 X T M 1 y Linear regression with non-independent outcomes Inputs M R n n, SP D(M), n [10 3,..., 10 4 ] X R n p, p [1,..., 20], full rank y R n Output b R p Sequence of thousands of problems Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34
74 Algorithm LL T = M X := L 1 X S := X T X GG T = S y := L 1 y b := X T y b := G 1 b b := G T b Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34
75 Algorithm bottleneck? LL T = M X := L 1 X to accelerator (TRSM) S := X T X GG T = S y := L 1 y b := X T y b := G 1 b b := G T b Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34
76 double+triple buffering GPU GPU trsm b GPU trsm b+1 Send b+1 Send b+2 CPU Recv b-1 CPU comp b-1 Recv b CPU b t HDD Read b+2 Write b-1 Read b+3 CPU GPU transfer HDD CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34
77 double+triple buffering GPU b-1 β trsm b α b-1 CPU/RAM b+1 b-1 b+2 C B A HDD Data X b-3 b b-2 b-1 b+1 b+2 b+3 Results r b-3 b-2 b-1 Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34
78 Summary Dense linear algebra: it s matter of layers GEMM = foundations, THE building block Irrespective of the architecture, need for highly optimized BLAS How to use threads/cores? Multithreaded BLAS vs. algorithms-by-blocks Large/many problems but limited memory streaming Paolo Bientinesi (AICES, RWTH Aachen) 33 / 34
79 References MKL cublas, CULA libflame BLIS OpenBLAS Magma Plasma Paolo Bientinesi (AICES, RWTH Aachen) 34 / 34
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationSuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1 How Heterogeneous? 2 How Many Languages? 3 How Many Languages? 3 Question! 4 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationMAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville
MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and
More informationThinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationGPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU
April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationJack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester
Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/20/13 1 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 National
More informationBLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg
University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6
More informationApproaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department
Approaches to acceleration: GPUs vs Intel MIC Fabio AFFINITO SCAI department Single core Multi core Many core GPU Intel MIC 61 cores 512bit-SIMD units from http://www.karlrupp.net/ from http://www.karlrupp.net/
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationRuntime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators
Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note #5 Ernie Chan Department of Computer Science The University of Texas at Austin Austin, Texas
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationHigh performance matrix inversion of SPD matrices on graphics processors
High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems
More informationStreaming Data from HDD to GPUs for Sustained Peak Performance
Streaming Data from HDD to GPUs for Sustained Peak Performance Lucas Beyer and Paolo Bientinesi RWTH Aachen University, Aachen Institute for advanced study in Computational Engineering Science, Germany
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationParallel Programming
Parallel Programming Timings, pt.3 Prof. Paolo Bientinesi HPAC, RWTH Aachen pauldj@aices.rwth-aachen.de WS17/18 Time Wall time or wall-clock time : real time between the beginning and the end of a computation
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationThe Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.
TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,
PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationParallel Linear Algebra in Julia
Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...
More informationLAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1
LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data
More informationLinear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you
More informationDAG-Scheduled Linear Algebra Using Template-Based Building Blocks
DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to
More informationAn Overview of High Performance Computing and Challenges for the Future
An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,
More informationJack Dongarra University of Tennessee Oak Ridge National Laboratory
Jack Dongarra University of Tennessee Oak Ridge National Laboratory 3/9/11 1 TPP performance Rate Size 2 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 100 Tflop/s 100000 10 Tflop/s 10000
More informationLevel-3 BLAS on a GPU: Picking the Low Hanging Fruit
Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Francisco D. Igual Gregorio Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad Jaume I 12.071 Castellón Spain {figualgquintan}@icc.uji.es
More informationTechnology on Dense Linear Algebra
Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationMax Planck Institute Magdeburg Preprints
Peter Benner Pablo Ezzatti Enrique S. Quintana-Ortí Alfredo Remón Matrix Inversion on CPU-GPU Platforms with Applications in Control Theory MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER SYSTEME
More informationPerformance Modeling of Pipelined Linear Algebra Architectures on FPGAs
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,
More informationUnleashing the Power of Multi-GPU Accelerators with FLAME
Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010
More informationMatrix Computations on GPUs, multiple GPUs and clusters of GPUs
Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationAcceleration of Hessenberg Reduction for Nonsymmetric Matrix
Acceleration of Hessenberg Reduction for Nonsymmetric Matrix by Hesamaldin Nekouei Bachelor of Science Degree in Electrical Engineering Iran University of Science and Technology, Iran, 2009 A thesis presented
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationCUDA Accelerated Compute Libraries. M. Naumov
CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party
More informationIntroduction to GPGPUs and to CUDA programming model: CUDA Libraries
Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationA class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationRuntime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System
Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Allan Richmond R. Morales, Chong Tian, Kwai Wong, Eduardo D Azevedo The George Washington University, The Chinese University
More informationTowards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Emmanuel Agullo 1, Henricus Bouwmeester 2, Jack Dongarra 1, Jakub Kurzak 1, Julien Langou 2,andLeeRosenberg
More informationPerformance Modeling for Ranking Blocked Algorithms
Performance Modeling for Ranking Blocked Algorithms Elmar Peise Aachen Institute for Advanced Study in Computational Engineering Science 27.4.2012 Elmar Peise (AICES) Performance Modeling 27.4.2012 1 Blocked
More information*Yuta SAWA and Reiji SUDA The University of Tokyo
Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,
More informationA GPU Sparse Direct Solver for AX=B
1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks
More informationCray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012
Cray Scientific Libraries: Overview and Performance Cray XE6 Performance Workshop University of Reading 20-22 Nov 2012 Contents LibSci overview and usage BFRAME / CrayBLAS LAPACK ScaLAPACK FFTW / CRAFFT
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationToward a supernodal sparse direct solver over DAG runtimes
Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX
More informationGPU Programming. Ringberg Theorie Seminar 2010
or How to tremendously accelerate your code? Michael Kraus, Christian Konz Max-Planck-Institut für Plasmaphysik, Garching Ringberg Theorie Seminar 2010 Introduction? GPU? GPUs can do more than just render
More informationTowards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg
More informationParallel Programming
Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias
More informationProgramming Dense Linear Algebra Kernels on Vectorized Architectures
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence
More informationA scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear
More informationIntel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop
Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library
More informationMixed Precision Methods
Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationTask based parallelization of recursive linear algebra routines using Kaapi
Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationHigh Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)
High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL) Michael Chuvelev, Intel Corporation, michael.chuvelev@intel.com Sergey Kazakov, Intel Corporation sergey.kazakov@intel.com
More informationQuantum ESPRESSO on GPU accelerated systems
Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January
More informationMAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel
MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project
More informationCS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)
CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming
More informationDense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee
Chapter 3 Dense Linear Algebra for Hybrid GPU-Based Systems Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Department of Electrical Engineering
More informationAccelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design
1 Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Senior Member, IEEE, Soumyendu Raha, S K Nandy, Senior Member,
More informationSparse Matrix Algorithms
Sparse Matrix Algorithms combinatorics + numerical methods + applications Math + X Tim Davis University of Florida June 2013 contributions to the field current work vision for the future Outline Math+X
More informationTechnical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta
Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures
More informationAccelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More information