SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

Size: px
Start display at page:

Download "SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin"

Transcription

1 SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1

2 How Heterogeneous? 2

3 How Many Languages? 3

4 How Many Languages? 3

5 Question! 4

6 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms 5

7 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms Programmability Parallelism Use tools provide by Directed acyclic graph FLAME 5 (DAG) scheduling

8 FLAME Answer: SuperMatrix Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA'07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages , San Diego, CA, USA, June 9-11, Chan, E., G. Van Zee, F., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. Satisfying your dependencies with SuperMatrix. InCluster'07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91-99, Austin,, USA, September 17-20, Chan, E., G. Van Zee, F., Bientinesi, P., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In PPoPP'08: Proceedings of the hirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages , Salt Lake City, U, USA, February 20-23, Quintana-Orti, G., Igual, F. D., Quintana-Orti, E. S., van de Geijn, R.. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009 Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R., G. Van Zee, F., and Chan, E.. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM ransactions on Mathematical Software, 36(3):14:1-14:26, July Chan, E.. Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations. Ph.D. dissertation, Department of Computer Science, he University of exas at Austin Quintana-Ortí, G., Igual, F. D., Marqués, M., Quintana-Ortí, E. S., and van de Geijn, R.. "A Runtime System for Programming Out-of-Core Matrix Algorithms-by-iles on Multithreaded Architectures." ACM ransactions on Mathematical Software (OMS) 38, no. 4 (2012): 25. 6

9 Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * Can the code be parallelized? 7

10 Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * Can the code be parallelized? Write After Read: (S0, S1) Read After Write: (S0, S1) Read After Write: (S1, S2) Read After Write: (S2, S3) Read After Write: (S1, S4) 7

11 Parallel? S0: D A*B Write After Read: (S0, S1) Read After Write: (S0, S1) S1: A L * L Read After Write: (S1, S2) S2: B B * L - Read After Write: (S2, S3) S3: C C B * B Read After Write: (S1, S4) S4: L -1 * Can the code be parallelized? Are you sure S1 and S2 cannot be parallelized? 7

12 Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B D A B A B L B C C B B L S4: L -1 * L B How to parallelize? 8

13 raditional Library Approach S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * How to parallelize? 9

14 raditional Library Approach S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S0: ParGemm (A,B,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) How to parallelize? 9

15 raditional Library Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /* */ FLA_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_RIANGULAR, A ); FLA_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ Supported by parallel BLAS, LAPACK (multi-thread BLIS) 10

16 Problem for Fine-grained Parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP 11

17 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Fine-grained parallelism BLIS pthreads OpenMP 11

18 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.

19 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.

20 Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.

21 Coarse-grained Parallelism Coarse-grained parallelism libflame libflame Fine-grained parallelism BLIS SuperMatrix pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 12 units.

22 SuperMatrix Approach S0: D A*B D A B S1: A L * L A L S2: B B * L - B B L S3: C C B * B C C B B S4: L -1 * L B How to parallelize? 13

23 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? 14

24 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? Partitioning/Algorithm-by-blocks! 15

25 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? 16

26 SuperMatrix Approach Construct the DAG across the instructions automatically No need to annotate the task dependencies manually! 17

27 raditional Library Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /* */ FLA_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_RIANGULAR, A ); FLA_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ Supported by parallel BLAS, LAPACK (multi-thread BLIS) 18

28 SuperMatrix Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /* */ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ 19

29 Free Lunch for Both Programmability and Performance! From libflame manual,

30 Original SuperMatrix primarily targets at multi-core shared memory system 21

31 22

32 HPC Heterogeneous Platforms Matrix 22

33 HPC Heterogeneous Platforms Matrix PCIE 22

34 Challenges in Heterogeneous Platforms! S0: D A*A S0: ParGemm (A,A,D) S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23

35 Challenges in Heterogeneous Platforms! S0: D A*A S0: ParGemm (A,A,D) S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23

36 Challenges in Heterogeneous Platforms! S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23

37 Challenges in Heterogeneous Platforms! /* */ Memcpy(A, ha); Memcpy(D, hd); Memcpy(B, hb); Memcpy(C, hc); Memcpy(, h); /* */ S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) /* */ Memcpy(h, ); /* */ S4: Parrsm(L,) What if there is one accelerator in your system? 24

38 Challenges in Heterogeneous Platforms! /* */ Memcpy(A, ha); Memcpy(D, hd); Memcpy(B, hb); Memcpy(C, hc); Memcpy(, h); /* */ S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) /* */ Memcpy(h, ); /* */ S4: Parrsm(L,) What if there are 4 GPUs and 8 CPU cores in your system? 24

39 Adapting Original SuperMatrix to Heterogeneous Platforms Software Cache Heterogeneous Scheduler Asynchronous Memory Copy Worker ask Performance Model 25

40 Naïve Approach PCIE 26

41 C A B Naïve Approach PCIE 26

42 Naïve Approach ransfer data from host to device before execution C A B PCIE 26

43 Naïve Approach ransfer data from host to device before execution C A B PCIE Execute the task on the device 26

44 Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C 26

45 Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C ransfer data from device to host upon execution 26

46 Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C ransfer data from device to host upon execution No Data Reuse on the devices! 26

47 Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

48 Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

49 C A B Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

50 C Software Cache No need to transfer data from host to device before execution if the data is already on the device PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

51 Software Cache No need to transfer data from host to device before execution if the data is already on the device PCIE No need to transfer data from device to host upon execution if the data is not required by the host immediately Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

52 HEF(Heterogeneous Earliest Finish ime) imeline ask 1 Where should we place ask 6? ask 6 ask 2 ask 3 ask 4 ask 5 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002):

53 HEF(Heterogeneous Earliest Finish ime) imeline ask 1 Where should we place ask 6? ask 6 ask 2 ask 3 ask 4 ask 5 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002):

54 HEF(Heterogeneous Earliest Finish ime) imeline ask 1 ask 6 ask 2 ask 3 ask 4 ask 5 ask 6 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002):

55 3x3 Blocked Cholesky Decomposition CHOL 0 A 0,0 Chol( A 0,0 ) RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 CHOL 6 A 1,1 Chol( A 1,1 ) RSM 7 A 2,1 A 2,1 A 1,1 - SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL 9 A 2,2 Chol( A 2,2 ) 55

56 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority 30

57 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

58 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 SYRK3 RSM 7 A 2,1 A 2,1 A 1,1 - GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

59 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

60 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

61 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

62 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

63 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

64 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

65 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

66 Data Distribution CPU GPU0 GPU1 A CHOL 0 A 0,0 Chol( A 0,0 ) A A A RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A A SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority

67 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach on Heterogeneous Platforms /* */ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ 41

68 Performance 6-core single-socket eon E5649 CPU + 1 G 480 GPU card BLOCK SIZE: core single-socket eon E5649 CPU + 2 esla C2070 GPU card BLOCK SIZE:

69 Conclusion libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms 43

70 S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach on Heterogeneous Platforms /* */ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /* */ 44

71 Related Work arget Platform Lapack Project FLAME Project Sequential LAPACK libflame Sequential+multithreaded BLAS LAPACK libflame Multicore/multithreaded PLASMA libflame+supermatrix Multicore+out-of-order scheduling PLASMA+Quark libflame+supermatrix CPU + single GPU MAGMA libflame+supermatrix Multicore + multi-gpu DAGuE/StarPU/ Kaapi libflame+supermatrix 45

72 Related Work arget Platform Lapack Project FLAME Project Sequential LAPACK libflame Sequential+multithreaded BLAS LAPACK libflame Multicore/multithreaded PLASMA libflame+supermatrix Multicore+out-of-order scheduling PLASMA+Quark libflame+supermatrix CPU + single GPU MAGMA libflame+supermatrix Multicore + multi-gpu DAGuE/StarPU/ Kaapi libflame+supermatrix 45

73 Questions? 46

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

Unleashing the Power of Multi-GPU Accelerators with FLAME

Unleashing the Power of Multi-GPU Accelerators with FLAME Unleashing the Power of Multi-GPU Accelerators with FLAME Enrique S. Quintana Ortí Universidad Jaume I quintana@icc.uji.es SIAM Conference on Parallel Processing for Scientific Computing Seattle, 2010

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Solving Large Dense Matrix Problems on Multi-Core Processors

Solving Large Dense Matrix Problems on Multi-Core Processors Solving Large Dense Matrix Problems on Multi-Core Processors Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071

More information

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on

More information

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators FLAME Working Note #5 Ernie Chan Department of Computer Science The University of Texas at Austin Austin, Texas

More information

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures Emmanuel Agullo 1, Henricus Bouwmeester 2, Jack Dongarra 1, Jakub Kurzak 1, Julien Langou 2,andLeeRosenberg

More information

Beautiful Parallel Code: Evolution vs. Intelligent Design

Beautiful Parallel Code: Evolution vs. Intelligent Design Beautiful Parallel Code: Evolution vs. Intelligent Design FLAME Working Note #34 Robert A. van de Geijn Department of Computer Sciences The University of Texas at Austin Austin, Texas 7872 rvdg@cs.utexas.edu

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Level-3 BLAS on a GPU: Picking the Low Hanging Fruit Francisco D. Igual Gregorio Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad Jaume I 12.071 Castellón Spain {figualgquintan}@icc.uji.es

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems Mercedes Marqués Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores Universidad

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32 Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators FLAME Working Note #32 Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Robert van de Geijn Abstract In a

More information

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas at Austin Austin, TX

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg

More information

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany Doing Nothing to Save Energy in Matrix Computations Enrique S. Quintana-Ortí quintana@icc.uji.esuji eeclust Workshop, Energy efficiency Motivation Doing nothing to save energy? Why at Ena-HPC then? Energy

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism Transforming Linear Algebra Libraries: From Abstraction to Parallelism FLAME Working Note #38 Ernie Chan, Robert van de Geijn, and Field G. Van Zee Department of Computer Sciences The University of Texas

More information

Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines

Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines Terry Cojean, Abdou Guermouche, Andra Hugo, Raymond Namyst, Pierre-André Wacrenier To cite this version: Terry

More information

Making Programming Synonymous with Programming for. Linear Algebra Libraries

Making Programming Synonymous with Programming for. Linear Algebra Libraries Making Programming Synonymous with Programming for Linear Algebra Libraries Maribel Castillo Ernie Chan Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Gregorio Quintana-Ortí Francisco D. Igual Enrique S. Quintana-Ortí Departamento de Ingeniería y Ciencia de Computadores Universidad

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Toward Scalable Matrix Multiply on Multithreaded Architectures

Toward Scalable Matrix Multiply on Multithreaded Architectures Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin

More information

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Parallel Tiled QR Factorization for Multicore Architectures

Parallel Tiled QR Factorization for Multicore Architectures Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari 1, Julien Langou 2, Jakub Kurzak 1, and Jack Dongarra 1,3,4 1 Computer Science Dept. University of Tennessee Knoxville, USA 2

More information

The StarPU Runtime System

The StarPU Runtime System The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms

Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms Emmanuel Agullo INRIA Bordeaux Sud-Ouest Talence, France Email: emmanuel.agullo@inria.fr Julien Herrmann

More information

Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources

Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources Olivier Beaumont, Terry Cojean, Lionel Eyraud-Dubois, Abdou Guermouche, Suraj Kumar To cite this version: Olivier Beaumont, Terry

More information

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

FLAME Working Note #23

FLAME Working Note #23 SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures FLAME Working Note #23 Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert van de Geijn Abstract

More information

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October

More information

FLAMES2S: From Abstraction to High Performance

FLAMES2S: From Abstraction to High Performance FLAMESS: From Abstraction to High Performance RICHARD VERAS The University of Texas at Austin and JONATHAN MONETTE The University of Texas at Austin and FIELD G. VAN ZEE The University of Texas at Austin

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Azzam Haidar, Hatem Ltaief, Asim YarKhan and Jack Dongarra Department of Electrical Engineering and

More information

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin libflame The Complete Reference ( version 5.1.0-46 ) Field G. Van Zee The University of Texas at Austin Copyright c 2011 by Field G. Van Zee. 10 9 8 7 6 5 4 3 2 1 All rights reserved. No part of this book

More information

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell

More information

Technology on Dense Linear Algebra

Technology on Dense Linear Algebra Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Informe Técnico ICC 2-2-28 Solving Dense Linear Systems on Graphics Processors Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. Quintana-Ortí Febrero de 28 Departamento

More information

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment Azzam Haidar 1, Chongxiao Cao 1, Asim YarKhan 1, Piotr Luszczek 1, Stanimire Tomov 1,

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29 Gregorio Quintana-Ortí Enrique S Quintana-Ortí Robert van de Geijn Field G Van Zee Ernie Chan

More information

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

A Square Block Format for Symmetric Band Matrices

A Square Block Format for Symmetric Band Matrices A Square Block Format for Symmetric Band Matrices Fred G. Gustavson 1, José R. Herrero 2, E. Morancho 2 1 IBM T.J. Watson Research Center, Emeritus, and Umeå University fg2935@hotmail.com 2 Computer Architecture

More information

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190 Parallel Tiled QR Factorization for Multicore Architectures Working Note # 190 Alfredo Buttari 1, Julien Langou 3, Jakub Kurzak 1, Jack Dongarra 12 1 Department of Electrical Engineering and Computer Science,

More information

arxiv: v1 [math.na] 24 Jul 2007

arxiv: v1 [math.na] 24 Jul 2007 Parallel Tiled QR Factorization for Multicore Architectures Alfredo Buttari 1, Julien Langou 3, Jakub Kurzak 1, Jack Dongarra 12 arxiv:0707.3548v1 [math.na] 24 Jul 2007 1 Department of Electrical Engineering

More information

Flexible Linear Algebra Development and Scheduling with Cholesky Factorization

Flexible Linear Algebra Development and Scheduling with Cholesky Factorization Flexible Linear Algebra Development and Scheduling with Cholesky Factorization Azzam Haidar haidar@eecs.utk.edu Chongxiao Cao ccao1@vols.utk.edu Stanimire Tomov tomov@eecs.utk.edu Asim YarKhan yarkhan@eecs.utk.edu

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

AperTO - Archivio Istituzionale Open Access dell'università di Torino

AperTO - Archivio Istituzionale Open Access dell'università di Torino AperTO - Archivio Istituzionale Open Access dell'università di Torino An hybrid linear algebra framework for engineering This is the author's manuscript Original Citation: An hybrid linear algebra framework

More information

An approach to provide remote access to GPU computational power

An approach to provide remote access to GPU computational power An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality

More information

StarPU: a runtime system for multigpu multicore machines

StarPU: a runtime system for multigpu multicore machines StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for

More information

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing Stanimire Tomov 1 and Jack Dongarra 1,2,3 1 University of Tennessee (USA) 2 Oak Ridge National Laboratory (USA) 3

More information

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2

More information

arxiv: v2 [cs.ms] 11 Sep 2007

arxiv: v2 [cs.ms] 11 Sep 2007 A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures LAPACK Working Note # 191 Alfredo Buttari 1, Julien Langou 4, Jakub Kurzak 1, Jack Dongarra 123 arxiv:0709.1272v2 [cs.ms]

More information

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient

More information

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures Buttari, Alfredo and Langou, Julien and Kurzak, Jakub and Dongarra, Jack 2007 MIMS EPrint: 2007.122 Manchester Institute

More information

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,

More information

SemCache: Semantics-Aware Caching for Efficient GPU Offloading

SemCache: Semantics-Aware Caching for Efficient GPU Offloading Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Spring 2015 SemCache: Semantics-Aware Caching for Efficient GPU Offloading Nabeel Al-Saber Purdue University Follow this

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing Stanimire Tomov,a, Rajib Nath a, Jack Dongarra a,b,c a University of Tennessee (USA)

More information

Max Planck Institute Magdeburg Preprints

Max Planck Institute Magdeburg Preprints Peter Benner Pablo Ezzatti Enrique S. Quintana-Ortí Alfredo Remón Matrix Inversion on CPU-GPU Platforms with Applications in Control Theory MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER SYSTEME

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Transforming Linear Algebra Libraries: From Abstraction to High Performance

Transforming Linear Algebra Libraries: From Abstraction to High Performance Transforming Linear Algebra Libraries: From Abstraction to High Performance RICHARD M. VERAS and JONATHAN S. MONETTE and ROBERT A. VAN DE GEIJN The University of Texas at Austin and ENRIQUE S. QUINTANA-ORTí

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Solution of the Transport Equation Using Graphical Processing Units

Solution of the Transport Equation Using Graphical Processing Units Solution of the Transport Equation Using Graphical Processing Units Gil Gonçalves Brandão October - 2009 1 Introduction Computational Fluid Dynamics (CFD) always have struggled for faster computing resources

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information