SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin
|
|
- Arthur Morton
- 5 years ago
- Views:
Transcription
1 SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1
2 How Heterogeneous? 2
3 How Many Languages? 3
4 How Many Languages? 3
5 Question! 4
6 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms 5
7 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms Programmability Parallelism Use tools provide by Directed acyclic graph FLAME 5 (DAG) scheduling
8 FLAME Answer: SuperMatrix Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA'07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , San Diego, CA, USA, June 9-11, Chan, E., G. Van Zee, F., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. Satisfying your dependencies with SuperMatrix. InCluster'07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91-99, Austin,, USA, September 17-20, Chan, E., G. Van Zee, F., Bientinesi, P., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In PPoPP'08: Proceedings of the hirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages , Salt Lake City, U, USA, February 20-23, Quintana-Orti, G., Igual, F. D., Quintana-Orti, E. S., van de Geijn, R.. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009 Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R., G. Van Zee, F., and Chan, E.. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM ransactions on Mathematical Software, 36(3):14:1-14:26, July Chan, E.. Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations. Ph.D. dissertation, Department of Computer Science, he University of exas at Austin Quintana-Ortí, G., Igual, F. D., Marqués, M., Quintana-Ortí, E. S., and van de Geijn, R.. "A Runtime System for Programming Out-of-Core Matrix Algorithms-by-iles on Multithreaded Architectures." ACM ransactions on Mathematical Software (OMS) 38, no. 4 (2012): 25. 6
9 Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * Can the code be parallelized? 7
10 Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * Can the code be parallelized? Write After Read: (S0, S1) Read After Write: (S0, S1) Read After Write: (S1, S2) Read After Write: (S2, S3) Read After Write: (S1, S4) 7
11 Parallel? S0: D A*B Write After Read: (S0, S1) Read After Write: (S0, S1) S1: A L * L Read After Write: (S1, S2) S2: B B * L - Read After Write: (S2, S3) S3: C C B * B Read After Write: (S1, S4) S4: L -1 * Can the code be parallelized? Are you sure S1 and S2 cannot be parallelized? 7
12 Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B D A B A B L B C C B B L S4: L -1 * L B How to parallelize? 8
13 raditional Library Approach S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * How to parallelize? 9
14 raditional Library Approach S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S0: ParGemm (A,B,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) How to parallelize? 9
15 raditional Library Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /* */ FLA_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_RIANGULAR, A ); FLA_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ Supported by parallel BLAS, LAPACK (multi-thread BLIS) 10
16 Problem for Fine-grained Parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP 11
17 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Fine-grained parallelism BLIS pthreads OpenMP 11
18 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.
19 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.
20 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.
21 Coarse-grained Parallelism Coarse-grained parallelism libflame libflame Fine-grained parallelism BLIS SuperMatrix pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 12 units.
22 SuperMatrix Approach S0: D A*B D A B S1: A L * L A L S2: B B * L - B B L S3: C C B * B C C B B S4: L -1 * L B How to parallelize? 13
23 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? 14
24 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? Partitioning/Algorithm-by-blocks! 15
25 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? 16
26 SuperMatrix Approach Construct the DAG across the instructions automatically No need to annotate the task dependencies manually! 17
27 raditional Library Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /* */ FLA_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_RIANGULAR, A ); FLA_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ Supported by parallel BLAS, LAPACK (multi-thread BLIS) 18
28 SuperMatrix Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /* */ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ 19
29 Free Lunch for Both Programmability and Performance! From libflame manual,
30 Original SuperMatrix primarily targets at multi-core shared memory system 21
31 22
32 HPC Heterogeneous Platforms Matrix 22
33 HPC Heterogeneous Platforms Matrix PCIE 22
34 Challenges in Heterogeneous Platforms! S0: D A*A S0: ParGemm (A,A,D) S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23
35 Challenges in Heterogeneous Platforms! S0: D A*A S0: ParGemm (A,A,D) S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23
36 Challenges in Heterogeneous Platforms! S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23
37 Challenges in Heterogeneous Platforms! /* */ Memcpy(A, ha); Memcpy(D, hd); Memcpy(B, hb); Memcpy(C, hc); Memcpy(, h); /* */ S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) /* */ Memcpy(h, ); /* */ S4: Parrsm(L,) What if there is one accelerator in your system? 24
38 Challenges in Heterogeneous Platforms! /* */ Memcpy(A, ha); Memcpy(D, hd); Memcpy(B, hb); Memcpy(C, hc); Memcpy(, h); /* */ S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) /* */ Memcpy(h, ); /* */ S4: Parrsm(L,) What if there are 4 GPUs and 8 CPU cores in your system? 24
39 Adapting Original SuperMatrix to Heterogeneous Platforms Software Cache Heterogeneous Scheduler Asynchronous Memory Copy Worker ask Performance Model 25
40 Naïve Approach PCIE 26
41 C A B Naïve Approach PCIE 26
42 Naïve Approach ransfer data from host to device before execution C A B PCIE 26
43 Naïve Approach ransfer data from host to device before execution C A B PCIE Execute the task on the device 26
44 Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C 26
45 Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C ransfer data from device to host upon execution 26
46 Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C ransfer data from device to host upon execution No Data Reuse on the devices! 26
47 Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
48 Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
49 C A B Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
50 C Software Cache No need to transfer data from host to device before execution if the data is already on the device PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
51 Software Cache No need to transfer data from host to device before execution if the data is already on the device PCIE No need to transfer data from device to host upon execution if the data is not required by the host immediately Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009
52 HEF(Heterogeneous Earliest Finish ime) imeline ask 1 Where should we place ask 6? ask 6 ask 2 ask 3 ask 4 ask 5 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002):
53 HEF(Heterogeneous Earliest Finish ime) imeline ask 1 Where should we place ask 6? ask 6 ask 2 ask 3 ask 4 ask 5 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002):
54 HEF(Heterogeneous Earliest Finish ime) imeline ask 1 ask 6 ask 2 ask 3 ask 4 ask 5 ask 6 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002):
55 3x3 Blocked Cholesky Decomposition CHOL 0 A 0,0 Chol( A 0,0 ) RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 CHOL 6 A 1,1 Chol( A 1,1 ) RSM 7 A 2,1 A 2,1 A 1,1 - SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL 9 A 2,2 Chol( A 2,2 ) 55
56 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority 30
57 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
58 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 SYRK3 RSM 7 A 2,1 A 2,1 A 1,1 - GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
59 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
60 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
61 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
62 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
63 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
64 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
65 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
66 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority
67 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach on Heterogeneous Platforms /* */ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ 41
68 Performance 6-core single-socket eon E5649 CPU + 1 G 480 GPU card BLOCK SIZE: core single-socket eon E5649 CPU + 2 esla C2070 GPU card BLOCK SIZE:
69 Conclusion libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms 43
70 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach on Heterogeneous Platforms /* */ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ 44
71 Related Work arget Platform Lapack Project FLAME Project Sequential LAPACK libflame Sequential+multithreaded BLAS LAPACK libflame Multicore/multithreaded PLASMA libflame+supermatrix Multicore+out-of-order scheduling PLASMA+Quark libflame+supermatrix CPU + single GPU MAGMA libflame+supermatrix Multicore + multi-gpu DAGuE/StarPU/ Kaapi libflame+supermatrix 45
72 Related Work arget Platform Lapack Project FLAME Project Sequential LAPACK libflame Sequential+multithreaded BLAS LAPACK libflame Multicore/multithreaded PLASMA libflame+supermatrix Multicore+out-of-order scheduling PLASMA+Quark libflame+supermatrix CPU + single GPU MAGMA libflame+supermatrix Multicore + multi-gpu DAGuE/StarPU/ Kaapi libflame+supermatrix 45
73 Questions? 46
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationUnleashing the Power of Multi-GPU Accelerators with FLAME
Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationSolving Large Dense Matrix Problems on Multi-Core Processors
Solving Large Dense Matrix Problems on Multi-Core Processors Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071
More informationMatrix Computations on GPUs, multiple GPUs and clusters of GPUs
Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on
More informationRuntime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators
Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note #5 Ernie Chan Department of Computer Science The University of Texas at Austin Austin, Texas
More informationTowards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Emmanuel Agullo 1, Henricus Bouwmeester 2, Jack Dongarra 1, Jakub Kurzak 1, Julien Langou 2,andLeeRosenberg
More informationBeautiful Parallel Code: Evolution vs. Intelligent Design
Beautiful Parallel Code: Evolution vs. Intelligent Design FLAME Working Note #34 Robert A. van de Geijn Department of Computer Sciences The University of Texas at Austin Austin, Texas 7872 rvdg@cs.utexas.edu
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationLevel-3 BLAS on a GPU: Picking the Low Hanging Fruit
Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Francisco D. Igual Gregorio Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad Jaume I 12.071 Castellón Spain {figualgquintan}@icc.uji.es
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationUsing Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems
Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators FLAME Working Note #32 Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Robert van de Geijn Abstract In a
More informationTransforming Linear Algebra Libraries: From Abstraction to Parallelism
Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas at Austin Austin, TX
More informationrcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationTowards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg
More informationMatrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany
Doing Nothing to Save Energy in Matrix Computations Enrique S. Quintana-Ortí quintana@icc.uji.esuji eeclust Workshop, Energy efficiency Motivation Doing nothing to save energy? Why at Ena-HPC then? Energy
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationHigh performance matrix inversion of SPD matrices on graphics processors
High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationTransforming Linear Algebra Libraries: From Abstraction to Parallelism
Transforming Linear Algebra Libraries: From Abstraction to Parallelism FLAME Working Note #38 Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas
More informationResource aggregation for task-based Cholesky Factorization on top of heterogeneous machines
Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines Terry Cojean, Abdou Guermouche, Andra Hugo, Raymond Namyst, Pierre-André Wacrenier To cite this version: Terry
More informationMaking Programming Synonymous with Programming for. Linear Algebra Libraries
Making Programming Synonymous with Programming for Linear Algebra Libraries Maribel Castillo Ernie Chan Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de Computadores Universidad
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationToward Scalable Matrix Multiply on Multithreaded Architectures
Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationThinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationParallel Tiled QR Factorization for Multicore Architectures
Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari 1, Julien Langou 2, Jakub Kurzak 1, and Jack Dongarra 1,3,4 1 Computer Science Dept. University of Tennessee Knoxville, USA 2
More informationThe StarPU Runtime System
The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationBridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms
Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms Emmanuel Agullo INRIA Bordeaux Sud-Ouest Talence, France Email: emmanuel.agullo@inria.fr Julien Herrmann
More informationScheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources
Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources Olivier Beaumont, Terry Cojean, Lionel Eyraud-Dubois, Abdou Guermouche, Suraj Kumar To cite this version: Olivier Beaumont, Terry
More informationEvaluation and Tuning of the Level 3 CUBLAS for Graphics Processors
Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores
More informationJose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.
SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationThe Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.
TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationFLAME Working Note #23
SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures FLAME Working Note #23 Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn Abstract
More informationUsing R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov
Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October
More informationFLAMES2S: From Abstraction to High Performance
FLAMESS: From Abstraction to High Performance RICHARD VERAS The University of Texas at Austin and JONATHAN MONETTE The University of Texas at Austin and FIELD G. VAN ZEE The University of Texas at Austin
More informationLevel-3 BLAS on the TI C6678 multi-core DSP
Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid
More informationAnalysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Azzam Haidar, Hatem Ltaief, Asim YarKhan and Jack Dongarra Department of Electrical Engineering and
More informationlibflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin
libflame The Complete Reference ( version 5.1.0-46 ) Field G. Van Zee The University of Texas at Austin Copyright c 2011 by Field G. Van Zee. 10 9 8 7 6 5 4 3 2 1 All rights reserved. No part of this book
More informationA Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell
More informationTechnology on Dense Linear Algebra
Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationSolving Dense Linear Systems on Graphics Processors
Informe Técnico ICC 2-2-28 Solving Dense Linear Systems on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí Febrero de 28 Departamento
More informationUnified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment
Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment Azzam Haidar 1, Chongxiao Cao 1, Asim YarKhan 1, Piotr Luszczek 1, Stanimire Tomov 1,
More informationWhy? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators
Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationProgramming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29
Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Gregorio Quintana-Ortí Enrique S Quintana-Ortí Robert van de Geijn Field G Van Zee Ernie Chan
More informationImplementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer
More informationSciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationA scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationA Square Block Format for Symmetric Band Matrices
A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture
More informationParallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190
Parallel Tiled QR Factorization for Multicore Architectures Working Note # 190 Alfredo Buttari 1, Julien Langou 3, Jakub Kurzak 1, Jack Dongarra 12 1 Department of Electrical Engineering and Computer Science,
More informationarxiv: v1 [math.na] 24 Jul 2007
Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari 1, Julien Langou 3, Jakub Kurzak 1, Jack Dongarra 12 arxiv:0707.3548v1 [math.na] 24 Jul 2007 1 Department of Electrical Engineering
More informationFlexible Linear Algebra Development and Scheduling with Cholesky Factorization
Flexible Linear Algebra Development and Scheduling with Cholesky Factorization Azzam Haidar haidar@eecs.utk.edu Chongxiao Cao ccao1@vols.utk.edu Stanimire Tomov tomov@eecs.utk.edu Asim YarKhan yarkhan@eecs.utk.edu
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationAperTO - Archivio Istituzionale Open Access dell'università di Torino
AperTO - Archivio Istituzionale Open Access dell'università di Torino An hybrid linear algebra framework for engineering This is the author's manuscript Original Citation: An hybrid linear algebra framework
More informationAn approach to provide remote access to GPU computational power
An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationAccelerating the reduction to upper Hessenberg form through hybrid GPU-based computing
Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing Stanimire Tomov 1 and Jack Dongarra 1,2,3 1 University of Tennessee (USA) 2 Oak Ridge National Laboratory (USA) 3
More informationBest Practice Guide to Hybrid PaRSEC + OpenMP Programming
Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2
More informationarxiv: v2 [cs.ms] 11 Sep 2007
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures LAPACK Working Note # 191 Alfredo Buttari 1, Julien Langou 4, Jakub Kurzak 1, Jack Dongarra 123 arxiv:0709.1272v2 [cs.ms]
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationA Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures Buttari, Alfredo and Langou, Julien and Kurzak, Jakub and Dongarra, Jack 2007 MIMS EPrint: 2007.122 Manchester Institute
More informationMAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville
MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,
More informationSemCache: Semantics-Aware Caching for Efficient GPU Offloading
Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Spring 2015 SemCache: Semantics-Aware Caching for Efficient GPU Offloading Nabeel Al-Saber Purdue University Follow this
More informationHigh-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen
High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationMaking Dataflow Programming Ubiquitous for Scientific Computing
Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale
More informationAccelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing Stanimire Tomov,a, Rajib Nath a, Jack Dongarra a,b,c a University of Tennessee (USA)
More informationMax Planck Institute Magdeburg Preprints
Peter Benner Pablo Ezzatti Enrique S. Quintana-Ortí Alfredo Remón Matrix Inversion on CPU-GPU Platforms with Applications in Control Theory MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER SYSTEME
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationTransforming Linear Algebra Libraries: From Abstraction to High Performance
Transforming Linear Algebra Libraries: From Abstraction to High Performance RICHARD M. VERAS and JONATHAN S. MONETTE and ROBERT A. VAN DE GEIJN The University of Texas at Austin and ENRIQUE S. QUINTANA-ORTí
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationSolution of the Transport Equation Using Graphical Processing Units
Solution of the Transport Equation Using Graphical Processing Units Gil Gonçalves Brandão October - 2009 1 Introduction Computational Fluid Dynamics (CFD) always have struggled for faster computing resources
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More information