Understanding the performances of sparse compression formats using data parallel programming model

Size: px

Start display at page:

Download "Understanding the performances of sparse compression formats using data parallel programming model"

Lynn Britney Glenn
5 years ago
Views:

1 2017 International Conference on High Performance Computing & Simulation Understanding the performances of sparse compression formats using data parallel programming model Ichrak MEHREZ, Olfa HAMDI-LARBI, Thomas DUFAUD, Nahid EMAD Université de Versailles St-Quentin, Université Paris-Saclay Li-Parad, 78035, Versailles, France Université de Tunis El Manar, Faculté des Sciences de Tunis URAPOP, 2092, Tunis, Tunisie m ichrak@hotmail.fr Maison de la Simulation, 91191, Saclay, France Abstract Several applications in numerical scientific computing involve very large sparse matrices with a regular or irregular sparse structure. These matrices can be stored using special compression formats (storing only non-zero elements) to reduce memory space and processing time. The choice of the optimal format is a critical process that involves several criteria. The general context of this work is to propose an auto-tuner system that, given a sparse matrix, a numerical method, a parallel programming model and an architecture, can automatically select the Optimal Compression Format (OCF). In this paper we study the performance of two different sparse compression formats namely CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column). Thus, we propose data parallel algorithms for Sparse Matrix Vector Product in the case of each format. We extract a set of parameters that can help us to select the more suitable compression format for a given sparse matrix. Keywords data parallel model; compression formats; sparse matrix; SMVP I. INTRODUCTION Several scientific applications such as fluid dynamics, circuit simulation, optimization problems, electromagnetics, images processing, etc. lead to sparse linear algebra problems. In most of these applications we handle large sparse matrices. A sparse matrix is generally defined as a matrix with a small number of non-zero elements. However, a matrix can be termed sparse whenever special techniques can be utilized to take advantage of the large number of zero elements and their locations [1]. To minimize memory space and optimize computational performance, user have to exploit the nature of these matrices using particular structure for data storage (storing only the non-zero elements). The chosen format highly depends on the sparse matrix structure that may be either regular or irregular (according to the locations of its non-zero elements). A sparse matrix has a regular structure if all non-zero elements are grouped in the matrix according to a particular pattern (e.g. triangular, diagonal, band matrices, etc.). However, a sparse matrix has an irregular structure (also called general) if their non-zero elements are scattered in the matrix (e.g. variable banded, random, etc.). For each sparse matrix structure, one or more dedicated compressed storage formats are known in the literature. The chosen compression format may affect the sparse algorithm performances. Thus, the choice of the most appropriate is a fairly critical process that must be well studied. Thereby, the study of different compressed formats and their impacts on the sparse computations are the subject of many research projects. Some works focus generally on optimizing the existing sparse compression formats or proposing new ones [2], [3], [4], [5]. These studies were carried out for a given architecture and in order to optimize a specific computing kernel (Sparse Matrix Vector Product in most cases). In [6] authors propose the SPARSITY system which is designed to help users get tuned kernels for sparse matrix computations without having to know the details of their machine memory hierarchy or the way in which the structures of their matrices will be mapped on this hierarchy. In [7], Vuduc et al. proposed the OSKI (Optimized Sparse Kernel Interface) library; it is a collection of low-level primitives providing auto-tuned computing kernels for sparse matrices. However, the choice of the sparse matrix compression format in these libraries is made statically. We notice that current solutions are generally focusing on optimizing only one criterion, such as minimizing storage space or accelerating sparse computing. In [8], we proposed a design of an auto-tuner system that, given a sparse matrix, a numerical method, a parallel programming model and an architecture, can automatically select the Optimal Compression Format (OCF) and propose its associated implementation. For this, we adopted a componentoriented approach to define an extensible system. Thus, the system can be completed progressively, without calling into question its basic principles. We aim that the proposed solution reduces time, memory, communication and energy complexities (Fig.1). Specifically, we proposed a first step of realizing our design. We described the modeling of an auto-tuner system and point out that user expertise is one of the main advantages of our system. Indeed, the user (expert) can guide the decision process by providing information on the application domain as well as the programming and the execution environment /17 $ IEEE DOI /HPCS

2 Fig. 1. Utilization component Thus, we proposed data parallel algorithms for Horner scheme when CSR, CSC, COO and ELL formats are used. We studied two level of data representation: scalar data and vector data. For each level, we defined metrics that allow us to make a choice among the studied formats. In this paper we limit our study to two representative formats namely CSR and CSC format. We start with a theoretical study where we apply a data parallel model for SMVP computing for both formats. To validate our study, we conduct experiments on a set of pathological matrices to comply with the theoretical study. We notice that metrics deduced from data parallel model are not sufficient to make a decision. Therefore, we proposed a new decision process taking into account data parallel model analysis and algorithm analysis. This paper is organized as follows: in section II, we present the data parallel programming model. In section III, we present sparse compression format. Section IV is devoted to presentation of the Sparse Matrix Vector Product computing kernel. In Section V we present the theoretical study. Section VI is devoted to presentation and interpretation of our experimental results. Finally, in section VII, we present concluding remarks and perspectives. II. PARALLEL PROGRAMING MODELS A parallel programming model is linked to the expression of parallel algorithm. It describes its tasks, their interactions and their effects on the computation state. There are two main parallel programming models: task or control parallelism and data parallelism. In the first one, the parallelism results from the application of different tasks on data. Whereas, data parallel algorithm is defined by a unique control flow of instructions. Each of them is executed on a set of data. In this paper, we present our study using the data parallel programming model. Indeed, we apply the same computing kernel on different matrix portions. In data parallel algorithm data are structured onto a virtual geometry representing the layout of virtual processors. In [9], some parameters are presented to evaluate and compare data parallel algorithms. Let: p be the number of virtual processors in the virtual geometry (1D, 2D,...), α be the number of virtual processors concerned by a data parallel operation, cx be the number of data parallel operations (other than communications). The computation ratio for a data parallel operation ops is defined by αops p then, the average computation ratio is: Φ = 1 α ops=cx α ops cx p α ops=1 The best data parallel algorithm to solve a scientific problem is the one with the smallest cx (the numerical aspect is supposed correct). Thus, the data parallel algorithm which maximizes Φ is often chosen since it generates less communications [9]. III. SPARSE COMPRESSION FORMATS Several formats have been proposed to optimize sparse matrix storage. Due to its simplicity and effectiveness, the CSR (Compressed Sparse Row) format remains the most used [1]. To store an n n sparse matrix A with nnz non-zero (1) 668

3 Thus, we evaluated two different optimized implementation of SMVP kernels represented by Algorithm 1 and Algorithm 2 and corresponding to SMVP when the matrix is stored, respectively, in CSR and CSC format. Fig. 2. CSR format elements, CSR format uses three arrays: val, of size nnz, to store the non-zero elements of A on a row-wise (from row 1 to row n), col ind, of size nnz, to store column indices of each stored element in val. and row ptr, of size n + 1, to store pointers on the head of each row in arrays val and col in with row ptr(n) = nnz (Fig.2). In opposite of CSR, the Compressed Sparse Column (CSC) Fig. 3. CSC format uses the arrays val to store the non-zero elements of A on a column-wise [1]. By analogy, it uses two other arrays: row ind, to store row indices of each stored element in val. and col ptr, of size n + 1, to store pointers on the head of each column in arrays val and row ind with col ptr(n) = nnz (Fig.3). IV. SPARSE MATRIX VECTOR PRODUCT KERNEL In this paper, we study Sparse Matrix Vector Product (SMVP) since it is an important computing kernel in many scientific applications that involve iterative operations. Let A be a (n n) sparse matrix with nnz non-zero elements. Our aim is to compute Y = A X, where X is an input dense array with n elements and Y the result array. The SMVP algorithm depends on the compression formats used for the sparse matrix. We present here two different SMVP algorithms corresponding to CSR and CSC compression formats. To improve the performances of SMVP, two manual optimization techniques are applied: scalar replacement and invariant operations removing. Scalar replacement: It consists on replacing an array element (indirect memory access) with a simple scalar variable in order to make faster access. This optimization is recommended in the case when the instruction in question is within a loop. Removing invariant operations: It aims to avoid a not required work by eliminating the loop invariant. Algorithm 1: SMVP for CSR format 1 for i=1 to n do 2 s = 0; 3 e = row ptr[i + 1]; 4 for j= row ptr[i] to e do 5 indx = col ind[j]; 6 s = s + val[j] X[indX]; 7 end 8 Y [i] = s; 9 end Algorithm 2: SMVP for CSC format 1 for i=1 to n do 2 x = X[i]; 3 e = col ptr[i + 1]; 4 for j= col ptr[i] to e do 5 indr = row ind[j]; 6 Y [indr] = Y [indr] + val[j] x; 7 end 8 end V. THEORETICAL STUDY In this section we present compression format study using data parallel programming model. Thus, we start with a theoretical study where, given a matrix, we apply the data parallel model to each proposed format (CSR and CSC). We organize a virtual geometry, on which we map and distribute data relatively to the used compressed format. Based on the given target architecture and the metrics that we will determine, we can then, theoretically, select the OCF. In a second step, we make an experimental study where we compute the SMVP kernel for each compression format. This step helps us analyze the relevance of the theoretical study. We set hypothesis: The architecture is fixed (cluster of multicore CPUs), We choose to not consider communication problems at this point, even it will have an important impact on OCF selecting process. Despite that these criteria have significant impact on the selecting process of the OCF, we choose to not consider them at this point to simplify the selection process. They will be incoporated into our auto-tuner system on futur work. In the following, we present how we extract metrics relative to data parallel programming model in the case of SMVP to select the OCF (among CSR and CSC). In our previous work [8], we studied two different models of data representation: Scalar data parallel operation: we assumed that each non-zero element (scalar data) is mapped to a virtual 669

4 Algorithm 3: Dot product: α = dot(x, Y ) = X T Y 1 α=0; 2 for i=1 to n do 3 α = α + X(i) Y (i); 4 end Algorithm 4: saxpy operation: Y = saxpy(α, X, Y ) = αx + Y 1 for i=1 to n do 2 Y (i) = α X(i) + Y (i); 3 end processor. Therefore, we have a 1D virtual geometry G of p virtual processors. Vector data parallel operation: We noticed that SMVP involves two main vector operations: dot product (Algorithm 3) and saxpy operation (Algorithm 4). Thus, we studied a second level of data parallel operation where each vector is mapped to a virtual processor. In [8], we noticed that we can not make a choice between CSR and CSC formats when we consider data as scalar. Indeed, metrics deduced from [9] were not sufficient to compare CSR and CSC format. Thus, in this paper, we limit our study to the second level of data parallel operation where we handle vector instead of scalar. Thus, We organize a 1D virtual geometry G vect of size p vect = n. Each vector is assigned to one virtual processor. Since we do not take into account communication in this work, the construction of the final result Y is not considered. To implement CSR format, we map the vector X and the sparse matrix rows on the virtual geometry. First, each virtual processor i initializes the i th component of Y with the i th value of X. We compute dot products (Algorithm 3) of each row i of the matrix (val(i, :)) and the vector X using only one data parallel vector operation with complexity O(nzmaxR) (nzmaxr is the maximum number of nnz elements per row). The i th virtual processor of the geometry stores the i th component of A X. To implement CSC format, we map the vector Y and the sparse matrix columns on the virtual geometry. First, each virtual processor j initializes the j th component of Y with the j th value of X. Each virtual processor j updates the vector Y : it performs a saxpy operation (Algorithm 4) involving the j th column of the matrix (val(:, j)) and the j th element of the vector X (X(j)). Let nzmaxc be the maximum number of nnz elements per column, the worst case complexity of this operation is O(nzmaxC)). Let β V ops be the number of virtual processors concerned by a data parallel vector operation V ops. For SMVP, one type of data parallel vector operation V ops is involved: multiplications ( ) (since we do not consider the construction of final result Y ). cx vect be the number of data parallel vector operations (other than communications). Table I represents vector data parallel metrics for CSC and CSR formats. The number of data parallel vector operation cx vect is the same for both tested formats. However, the cost cost V ops of multiplication vector operation is different. Indeed, it depends on either nzmaxr or nzmaxc (since, theoretically, makespan depends on the execution time of the most charged process). We can conclude that the best performance is obtained when we use the format that minimizes vector operations cost i.e. nzmaxr or nzmaxc. We remark, from table I, that the time complexity (cost V ops ) is smaller in the case of CSR (respectively CSC) when nzmaxr <nzmaxc (respectively nzmaxc <nzmaxr). To be conform to this study, we make a pathological case composed of two types of matrices: M atrow (respectively M atcol) with width contiguous dense rows (respectively columns). Figure 4 illustrates examples of pathological matrices. These matrices have the particularity of satisfying the bound criterion (nzmaxr and nzmaxc). Indeed, when the matrix has a structure close to that of a pathological matrix we directly choose the compression format based on data parallel metrics. Otherwise, our system makes a self-update and the decision process will be based on rules database. i.e. we compute the SMVP for each compression format and select the one that minimizes the computing time. Algorithm 5 describes the decision process deducted from the theoretical study. TABLE I VECTOR DATA PARALLEL METRICS CSR CSC cx vect 1 1 p vect n n β n n cost V ops nzmaxr nzmaxc Fig. 4. Example of pathological matrices used in our tests VI. PERFORMANCE EVALUATION Let P P be the number of physical processors. In this section, we evaluate the performances of SMVP when the pathological matrix is stored in CSC or CSR format. Two different cases are studied: P P = p vect and P P < p vect. 670

TABLE II DATA ACCESS WHEN P P = p vect #Indirect Access #Read #Write #Operations CSR 3 nzmaxr 4 nzmaxr 2 nzmaxr 2 nzmaxr CSC 4 nzmaxc 4 nzmaxc 2 nzmaxc 2 nzmaxc Fig. 5.

5 TABLE II DATA ACCESS WHEN P P = p vect #Indirect Access #Read #Write #Operations CSR 3 nzmaxr 4 nzmaxr 2 nzmaxr 2 nzmaxr CSC 4 nzmaxc 4 nzmaxc 2 nzmaxc 2 nzmaxc Fig. 5. Experimental results N=1152, P P = p vect TABLE III DATA ACCESS WHEN P P < p vect # Read # Write # Indirect Access # Operations CSR 3 nrp P tmax + 6 nzp P tmax 4 nrp P tmax + 3 nzp P tmax 2 nrp P tmax + 4 nzp P tmax 2 nzp P tmax CSC 3 ncp P tmax + 6 nzp P tmax 3 ncp P tmax + 3 nzp P tmax 2 ncp P tmax + 5 nzp P tmax 2 nzp P tmax nzp P tmax: number of nnz element in P P tmax nrp P tmax: number of rows in P P tmax ncp P tmax: number of columns in P P tmax A. First case: P P = p vect In the first case, we evaluate our solution using a physical processor number P P equal to that of the virtual geometry p vect. We conduct our experiments on a multicore cluster within Grid5000 platform. The cluster consists of 72 nodes, each node with two hexa-core Intel Xeon E v3 processors (PP=1152 cores in total) is the maximum number of homogeneous processors within Grid5000 platform. Consequently, we evaluate our tests using pathological matrices with n = P P = p vect =1152. The experiments were conducted by measuring the total execution time of SMVP kernel for each matrix. We note that when P P = p vect the total execution time is bounded by (theoretically, equal to) the execution time of the most loaded processor. For the CSR (respectively CSC) format, the most loaded process is the one handling nzmaxr (respectively nzmaxc) elements. Based on the theoretical study (Algorithm 5), for matrices like M atrow (respectively M atcol) the CSC format (respectively CSR format) is optimal. Figure 5 (a) (respectively (b)) illustrates the execution time while applying SMVP for MatRow (respectively MatCol) when matrix A is stored in CSR then in CSC. For MatCol, for any size of width, the experimental results validate the theoretical study. However, we note that in the case of M atrow experimental and theoretical results are coherent for width value less than 750. In the case of the MatRow matrix stored in the CSR format, the number of operations handled by the most loaded processor increases with the width size. However, it doesn t reach the 671

6 Algorithm 5: Decision process from theoretical study 1 Extract Matrix Characteristics(A); 2 if IsClose(Structure(A),Structure(pathological matrix)) then 3 if nzmaxr <nzmaxc then 4 f ormat=csr; 5 else 6 f ormat=csc; 7 end 8 else 9 Update tests database(); 10 Update rules database(); 11 f ormat=select format(a); 12 end Algorithm 6: CSR format: Dot product handled by the most loaded processor when P P = p vect 1 for j= 0 to nzmaxr do 2 indx = col ind[j]; 3 s = s + val[j] X[indX]; 4 end Algorithm 7: CSC format: saxpy operation handled by the most loaded processor when P P = p vect 1 for j= 0 to nzmaxc do 2 indr = row ind[j]; 3 Y [indr] = Y [indr] + val[j] x; 4 end Algorithm 8: Decision process 2 1 Extract Matrix Characteristics(A); 2 if IsClose(Structure(A),Structure(pathological matrix)) then 3 if CSR cost <CSC cost then 4 f ormat=csr; 5 else 6 f ormat=csc; 7 end 8 else 9 Update tests database(); 10 Update rules database(); 11 f ormat=select format(a); 12 end number when the same matrix is stored in the CSR format (5 (f)). Nevertheless, from a certain value of width (750), the CSR provides better performances than the CSC (Fig. 5 (a)). Thus, we can conclude that the metrics given by data parallel model analysis alone does not lead to the right choice in certain cases. These latter can be prevented by an analysis of the algorithm. In fact, when analyzing Algorithm 6 and Algorithm 7, we notice that another factor can impact the execution time which is the data read/write and that depends on nzmaxr and nzmaxc. Table II represents a study of data access number and operation number for SMVP when CSR and CSC formats are used. These values are associated to the most loaded process since the total execution time is bounded by its time. We extract these metrics from Algorithm 6 and Algorithm 7. The most loaded process in the case of CSR (respectively CSC) handles one row (respectively column) with nzmaxr (respectively nzmaxc) elements. Indirect access corresponds to array element access (e.g. val[j]). We notice from Fig.6 that data read (respectively write) number using CSC is, for any value of width, less than that using CSR. Despite the fact that the Read/Write data access influence the total time execution, it can not explain the results presented in fig 5 (a). Although, the number of indirect access to array elements can be the origin of this result. In fact, each indirect access requires an additional time for address computing. In the rest of the paper, Indirect Access denote array element address computing (read or write). Indeed, we remark from Fig.5(c) that the number of indirect access using CSC format is greater than that of CSR format from width=850. Notice that it is not the same value (750) detected in Fig 5 (a). Thus, we conclude that the increasing number of operations, read/write data and indirect access at the same time affect the SMVP performances. Since each format has its own cost, we define: CSR cost = 3 nzmaxr AC cost + 4 nzmaxr R cost + 2 nzmaxr W cost + 2nzmaxR OP cost (2) CSC cost = 4 nzmaxc AC cost + 4 nzmaxc R cost + 2 nzmaxr W cost + 2nzmaxR OP cost (3) where AC cost is the cost of an array element address computing (Indirect Access), R cost is the cost of a Reading access, W cost is the cost of a Writing access, OP cost is the cost of an arithmetic operation (multiplication or addition). The absolute weights for each component (AC cost, R cost, W cost and OP cost ) in Equations 2 and 3 are determined from table II. Therefore, we define a new decision process presented by Algorithm 8. We note that R cost, W cost, AC cost and OP cost are architecture dependent. In this work, we only define theoretical CSR cost and CSC cost. In our future work, we will present how our system extract these parameters. B. Second case: P P < p vect SMVP is not only dependent on nnz/operation number. To validate this conclusion, we analyses the SMVP performances for MatRow matrices when P P < p vect. Hence, each physical processor handles a set of rows (respectively column) in the case of CSR (respectively CSC) format. For that, we use two different approaches for data distribution: MPI partitioning and S-GBNZ algorithm. 672

1) MPI partitioning for data distribution: MPI partitioning strategy consists in affecting processes to processors in turn. The first nb process are assigned respectively to the nb node (one core).

We evaluates the performances of M atrow using CSR and CSC formats when n = 10000 (we are limited by the number of MPI processes that can be handled by a core).

7 1) MPI partitioning for data distribution: MPI partitioning strategy consists in affecting processes to processors in turn. The first nb process are assigned respectively to the nb node (one core). This step is repeated until processes ending. In our case, each row (respectively column) in the case of CSR (respectively CSC) format is handled by an MPI process. We evaluates the performances of M atrow using CSR and CSC formats when n = (we are limited by the number of MPI processes that can be handled by a core). Experiments were conducted on the same cluster used in section VI-A. The cluster consists of 72 nodes, each node with two hexa-core Intel Xeon E v3 processors. 2) S-GBNZ algorithm for data distribution: The S-GBNZ (Sorted Generalized Fragmentation with Balanced Number of Non-zeros) algorithm is proposed in [10] for the SMVP partitioning on a large scale distributed system. The objective of this algorithm is to decompose a sparse matrix into noncontiguous row blocks by balancing the number of non-zero elements between blocks (Fig. 7). The S-GBNZ approach consists of three phases as follows: Phase 0 is a sorting phase; it consists in sorting the input matrix rows in decreasing number of non-zero elements. Phase 1 is based on LS heuristic (List Scheduling). In the first step, the first p rows of the input matrix are assigned respectively to the p partitions (p is the number of fragments): each row i(i = 1...p) is assigned to the i t h partition. Thereafter, the remaining rows are assigned to the fragments based on their load (number of non-zero elements). In other words, an unassigned current row is always assigned to the least loaded partition until rows ending. Phase 2, which is a phase of improvement, is an iterative heuristic. It allows, through successive refinements, to improve the previous partitioning, leading to a better balance [10]. Experiments are conducted on 10 nodes of the cluster used in section VI-A (160 processors in total). We evaluate the performances of MatRow using CSR and CSC formats when n = In the case of CSC format, S-GBNZ algorithm handles columns instead of rows. 3) Results interpretation: Results presented in this part correspond to the processor with the maximum execution time that we note P P tmax. Table III represents a study of number of data access operation for SMVP when P P < p vect. These metrics are extracted from Algorithm 1 and Algorithm 2. The experimental study, using the MPI partitioning or the S-GBNZ algorithm, leads to the following conclusions: In 50% of cases, the process with maximum execution time is not the same one with maximum load. This results proves that the operation number (deduced from the data parallel analysis) is not a sufficient metric to predict the SMVP performances and select the OCF. We remark from Fig.8 (b) and Fig.9 (b) that the number of indirect access using CSC format becomes greater than that of CSR format from a certain value of width. Using MPI partitioning, we notice from Fig.8 (a) that when width <1000, SMVP using CSC format has better performances than that using CSR format. Indeed, this is due to the low numbers of operations and indirect access of CSC format compared to CSR (see fig 8 (b) and 8 (c)). From a value of width=1000, the variation on the execution time is explained by the increasing numbers of operation indirect access (which become in the case of CSC greater than that of CSR). Using S-GBNZ algorithm, we notice from Fig.9 that when width <500, SMVP performances using CSR and CSC format are the same. This is due to the close number of operations and indirect access for both formats. From a value of width=1000, we remark that the number of indirect access using CSC format becomes slightly greater then that using CSR. Although, there is no difference in performances due to the equal number of operations (Fig.9 (c)). Indeed, as for the operation number, we can conclude that the indirect access number only can t help us making a choice. Fig. 7. Example of using S-GBNZ algorithm for sparse matrix partitioning Therefore, when P P < p vect, we define a more general value of CSR cost and CSC cost : CSR cost = (3 nrp P tmax + 6 nzp P tmax ) R cost + (4 nrp P tmax + 3 nzp P tmax ) W cost Fig. 6. MatRow : number of Read/Write access in the most loaded process, P P = p vect = n = (2 nrp P tmax + 4 nzp P tmax ) AC cost + 2 nzp P tmax OP cost (4) 673

Fig. 8. Experimental results using MPI partitioning, MatRow matrices, P P = 1152 < p vect = n = 10000 Fig. 9.

8 Fig. 8. Experimental results using MPI partitioning, MatRow matrices, P P = 1152 < p vect = n = Fig. 9. Experimental results using S-GBNZ algorithm, MatRow matrices, P P = 160 < p vect = n = ACKNOWLEDGMENT CSC cost = (3 ncp P tmax + 6 nzp P tmax ) R cost + (3 ncp P tmax + 3 nzp P tmax ) W cost + (2 ncp P tmax + 5 nzp P tmax ) AC cost + 2 nzp P tmax OP cost (5) Thus, when P P < p vect, the decision process is presented by Algorithm 8 where CSR cost and CSC cost are respectively defined by equation 4 and equation 5. In addition, we note that the used partitioning method can affect the performances. In our future work, an analysis can be produced on the different partitioning methods and can be included in the analysis of the data parallel model. VII. CONCLUSION In this paper, we have studied the performances of SMVP using two different compression formats namely CSR and CSC. The final aim of this work is to define parameters and metrics that, given a sparse matrix, a numerical method, a parallel programming model and an architecture, allows us to select the OCF for a given sparse matrix. Thus, we applied the data parallel model for parallelizing the SMVP algorithm when CSR and CSC formats are used for storing sparse matrix. We defined metrics that allow us to make a choice among the studied formats. To validate our study, we conduct experiments on a set of pathological matrices to comply with the theoretical study. We concluded that extracted metrics from the data parallel model analysis are not sufficient to make decision. Therefore, we define new metrics CSR cost and CSC cost involving the number of operation and the number of indirect data access. Thus, we proposed a new decision process taking into account data parallel model analysis and algorithm analysis. For future work we aim to consider CSR cost and CSC cost metrics to be automatically detected and integrated in the process of selecting the OCF. Furthermore, an analysis on the impact of different partitioning methods can be produced. Experiments presented in this paper were carried out using the Grid 5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see REFERENCES [1] Y. Saad, Iterative Methods for Sparse Linear Systems, second edition ed., [2] A. Yzelmen and R. Bisseling, Two-dimensional cache-oblivious sparse matrix-vector multiplication, Parallel Computing, vol. 37, pp , December [3] R. Bisseling, Parallel scientific computation, a structured approach using bsp and mpi, Utrecht University, Oxford University Press, Tech. Rep., Mai [4] K. Kourtis, G.Goumas, and N. Koziris, Improving the performance of multithreaded sparse matrix-vector multiplication using index and value compression, in 37th International Conference on Parallel Processing, September [5] R. Shahnaz and A. Usman, Blocked-based sparse matrix-vector multiplication on distributed memory parallel computers, International Arab Journal of Information Technology, pp , April [6] E. Im, K.Yelick, and R. Vuduc, Sparsity: Optimization framework for sparse matrix kernels, International Journal of High Performance Computing Applications, vol. 18, pp , Febrary [7] R. Vuduc, J. Demmel, and K. Yelick, Oski: A library of automatically tuned sparse matrix kernels, in SciDAC 2005, june [8] I. Mehrez, O. Hamdi-Larbi, T. Dufaud, and N. Emad, Towards an auto-tuning system design for optimal sparse compression format selection with user expertise, in 13th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2016, November- December [9] S. Petiton and N. Emad, A data parallel scientific computing introduction, Computer Science, vol. 1132, pp , [10] O. Hamdi-Larbi, Z. Mahjoub, and N. Emad, Load balancing in reprocessing of large-scale distributed sparse computation, in 8th LCI Conference on High Performance Clustered computing, May

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin