Performance Strategies for Parallel Mathematical Libraries Based on Historical Knowledgebase

Size: px

Start display at page:

Download "Performance Strategies for Parallel Mathematical Libraries Based on Historical Knowledgebase"

Estella Jocelyn Austin
5 years ago
Views:

1 Performance Strategies for Parallel Mathematical Libraries Based on Historical Knowledgebase CScADS workshop 29 Eduardo Cesar, Anna Morajko, Ihab Salawdeh Universitat Autònoma de Barcelona

2 Objective Mathematical Library (PETSc) Performance Methodology Knowledgebase Data Analysis Load balancing Memory pre-allocation Summary What we have would like 2

3 Design a methodology, based on input data, that can automatically choose between existing solving parameters to improve PETSc application s performance 3

Algorithms KSP PC Draw Data Structure Matrices

4 Application Code Other higher Level solvers PDE Solvers SNES TS Time Stepping SLES Mathematical Algorithms KSP PC Draw Data Structure Matrices Vectors Index set Level of abstraction BLAS LAPACK MPI 4

Data Structure Matrices Vectors Index set Fundamental objects

5 Data Structure Matrices Vectors Index set Fundamental objects for storing linear operators There is no data structure appropriate for all problems. PETSc supports: dense storage where each process stores its entries in the usual Array style. Compressed sparse row storage, where only the nonzero values be stored. Block compressed Sparse row Storage Block Diagonal Storage 5

6 Application Code Other higher Level solvers PDE Solvers SNES TS Time Stepping SLES Mathematical Algorithms KSP PC Draw Data Structure Matrices Vectors Index set Level of abstraction BLAS LAPACK MPI 6

7 Mathematical Algorithms KSP PC To solve a linear system you need to calculate the matrix s preconditioner (PC) then to apply the Krylov Subspace method (KSP) Different mathematical algorithms exist either to calculate the preconditioner (jacobi, LU, etc.) or for the KSP (GMRES, CG, etc.) 7

4 5 7 4 2 3 4 5 6 7 7 6 4 2 7 9 5 7 8 9 9 5 7 5 8 6 8 7 5 5 8 9 9 8 7 Pattern recognition Data Mining Load

.. Dense cgs bjacobi 4 Mpibdaij chebychev as Mpiaij rechardson jaco Mpiaij chebychev asm Dense cgs bjacobi

8 Pattern recognition Data Mining Load balancing Knowledgebase Mathematical Constrains Mpiaij rechardson jaco Mpiaij chebychev asm... Dense cgs bjacobi 4 Mpibdaij chebychev as Mpiaij rechardson jaco Mpiaij chebychev asm Dense cgs bjacobi diag Solve problem Control Solution Memory preallocation

9 PETSc Solve Ax = b Main Routine Linear Solvers (KSP) PC Application Initialization Initialize A and b Post- Processing User code PETSc code 9

10 PETSc Solve Ax = b Pattern recognition Main Routine Knowledgebase Linear Solvers (KSP) PC Data Mining Solver Parameters Decision Application Initialization Initialize A and b Post- Processing User code PETSc code User code PETSc code Tuning Code

Main Routine 2 3 3 2 5 3 3 6 5 2 9 3 8 3 4 6 5 2 9 3 8 3 4 2 65 3 3 9 8 4

11 Main Routine Pattern recognition Knowledgebase Data Mining Solver Parameters Decision PETSc Solve Ax = b Linear Solvers (KSP) PC Application Initialization Initialize A and b Post- Processing User code PETSc code Tuning Code

12 Contains performance characterization of historical executions for different input data and hardware configurations It is organized according to the most common matrix patterns and the matrix size 2

13 Input Type Input Size No. Processors Memory Representation KSP PC Total Time (s) Matrix Pattern Around-Diagonal 952X952 8 mpiaij bcgs asm 3, Around-Diagonal 952X952 8 mpiaij bcgs bjacobi 25,5934 Around-Diagonal 952X952 8 mpiaij bcgs jacobi 35, Around-Diagonal 952X952 8 mpiaij bcgsl asm 222,55694 Around-Diagonal 952X952 8 mpiaij bcgsl bjacobi 89,236794,942,46,7576 2,63E-5,3553,489,896,2772 2,87E-5,7592,3492,298,72,275,8965,725,2237,86,767,78,94,427,57,245,23,756,866,373,3687,43,439,22,2387,59,965,852,569,7959,2629,457,4,52,246,3,846,2625,2327,7945,585,27,682,37,898,65,23,859,4,2374,298,642,925,423,2647,893,38,257,998,89,58,267,928,645,44,959,488,5944,624,623,5,885,2642,25,2499,359 5,25E-5,46,5852,4293,839,367,224,8769,9,9E-6,359,254,25,527,628,4946,623,873,258,224,2648,8342,2474 5,25E-5,52,6252,428,354,342,226,9,2455,8324,542,5666,632,234,375,6967,6452,69,2336,23866 Around-Diagonal 952X952 8 mpiaij bcgs asm 3, Distributed 9242X mpiaij bcgs asm 522,5523 Distributed 9242X mpiaij bcgs bjacobi 455,43445 Distributed 9242X mpiaij bcgs jacobi 459,62954 Distributed 9242X mpiaij bcgsl asm 457,2476,2539,2526,26,24,34,237,366,2525,2554,2556,8,5,333,368,2544,2546,2529,26 4,5E-5,39,379,376,47,759,6,763,62,25,6,326,35,29,45,6,38,56,26 7,2E-5,338,362,389,7,7 9,5E-5 2,7E-5,4E-6 5,9E-5,49,26,44,7,7,4 5,E-6 4,2E-5,28,5,44,26 2,8E-5,22,4 6,7E-5,37,27,9,45,74 4,E-6,6 9,8E-5,,34,26,48,5,5,26 8,9E-6 7E-6,,76,35,94,49,229,52,274,E-6 5,4E-7 8,2E-5,6,24,34,237,366,243,669,26,2,2,,8,5,333,368,368,37,692,263,22,2,22,2,2 4,5E-5,39,379,376,37,677,2546,2564,26,7 4,9E-6,7 5,E-6,99,25,6,326,35,396,37,685,2573,2569,2599,35,7,35,7,7 7,2E-5,338,362,389,397,684,2548,257,2548,2572,26,89,42,88,42,385,23,2,352,245,398,3,73,755,259,758,2586,82,2433,8,2427,82,6,52,26,253,2,263,3,744,6,75,2,768,92,26,2526,2542,2543,2535,76,54,2,7 3,4E-5 9,5E-6,9E-5,4E-6,2 4,9E-6,7,42,826,2559,258,2533,259,2534,5,9,3,8 2,4E-6,7E-5,E-6,2,22,6,35,86,2435,254,26 Distributed 9242X mpiaij bcgsl bjacobi 456,5858 Distributed 9242X mpiaij bcgsl jacobi 455,3952 3

14 4 25 Non-zero Blocks % Non-zero Diagonal

15 5 25 Non-zero Blocks % Non-zero Diagonal

16 6 25 Non-zero Blocks % Non-zero Diagonal 8 Non-zero Blocks 5% Non-zero Diagonal = 25 Diagonal distance = 5%

17 Number of Similar Entries = 39 Number of Different Entries = 25

18 Matrix No. Similar Entries (intersection) No. of Different Entries Union Matrix Similarity (Fitness) City Block (Manhattan) distance Diagonal Density Distance Mat % Mat % - Fitness Distance Max Distance (64)

19 Matrix No. Similar Entries (intersection) No. of Different Entries Union Matrix Similarity (Fitness) N. City Block (Manhattan) distance N. Diagonal Density Distance Mat Mat

20 Pattern recognition Data Mining Load balancing Knowledgebase Mathematical Constrains Mpiaij rechardson jaco Mpiaij chebychev asm... Dense cgs bjacobi 4 Mpibdaij chebychev as Mpiaij rechardson jaco Mpiaij chebychev asm Dense cgs bjacobi diag Solve problem Control Solution Memory preallocation

5 6 7 Uniform Matrix Distribution using PETSC_DECIDE 7 5 7 3 8 7 Non uniform Matrix

21 5 6 7 Uniform Matrix Distribution using PETSC_DECIDE Non uniform Matrix Distribution using PETSC_DECIDE Non uniform Matrix Distribution using Load Balancing Module

22 Mat A; Int column[3],i; double value[3]; MatCreate (PETSC_COMM_WORLD,&A); MatSetSizes (A, PETSC_DECIDE, PETSC_DECIDE, N, N); MatSetFromOption(A); Matrix global index value[] = -.; value[] = 2.; value[2] = -.; if (rank == ) { for (i=; i<n-2; i++) { column[] = i-; column[] = i; column[2] = i+; MatSetValues(A,,&i,3,column,value,INSERT_VALUES); } } // Must set also the and N- values PETSc determine the matrix cross-processor distribution MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);

23 Line-based distribution for a 9242x 9242 matrix (PETSC_DECIDE) Proc. Number of Lines Values Total

24 Value-based distribution for a 9242x 9242 matrix Proc. Number of Lines Values Total

25 After specifying the data structure properties and the way it will be distributed, it is needed to allocate the memory. Automatic: each SET_VALUE call PETSc allocates the memory that corresponds to this value. (nonzero values malloc) Each processor allocate a memory of the size of its corresponding rows. ( malloc) Allocate the memory for the local sub matrix depending on the most dense row. ( malloc) Calculate the number of nonzero values per row. And allocate the memory for the row. (No. rows malloc) Each local processor loads its own data from the file 29

Create Setting Reading Assembly Solve Create Setting Reading Assembly Solve 3 3 25 25

26 Create Setting Reading Assembly Solve Create Setting Reading Assembly Solve Execution time (s) 2 5 Executin time (s) Perfromance Performance 3

We have designed a methodology based on input data that can automatically choose between existing solving parameters to improve PETSc application s performance Our methodology follows data mining

27 We have designed a methodology based on input data that can automatically choose between existing solving parameters to improve PETSc application s performance Our methodology follows data mining techniques and is based on the comparison of the input matrix with a set of predefined patterns that are stored in a knowledgebase The load balancing method automatically hands data out to all processes based on the non zero-value distribution across the matrix Choosing the suitable memory pre-allocation method improves memory utilization and reduces the cost of memory allocation 33

28 Components for analyzing the input matrix Database with hundreds of executions Components for load balancing and memory preallocation We can easily produce a wrapper for PETSc calls A more detailed characterization of the hardware (cache, memory, FLOPS, #cores?) 34

Anna Morajko.

Anna Morajko. Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance