Understanding the performances of sparse compression formats using data parallel programming model

Size: px
Start display at page:

Download "Understanding the performances of sparse compression formats using data parallel programming model"

Transcription

1 2017 International Conference on High Performance Computing & Simulation Understanding the performances of sparse compression formats using data parallel programming model Ichrak MEHREZ, Olfa HAMDI-LARBI, Thomas DUFAUD, Nahid EMAD Université de Versailles St-Quentin, Université Paris-Saclay Li-Parad, 78035, Versailles, France Université de Tunis El Manar, Faculté des Sciences de Tunis URAPOP, 2092, Tunis, Tunisie m ichrak@hotmail.fr Maison de la Simulation, 91191, Saclay, France Abstract Several applications in numerical scientific computing involve very large sparse matrices with a regular or irregular sparse structure. These matrices can be stored using special compression formats (storing only non-zero elements) to reduce memory space and processing time. The choice of the optimal format is a critical process that involves several criteria. The general context of this work is to propose an auto-tuner system that, given a sparse matrix, a numerical method, a parallel programming model and an architecture, can automatically select the Optimal Compression Format (OCF). In this paper we study the performance of two different sparse compression formats namely CSR (Compressed Sparse Row) and CSC (Compressed Sparse Column). Thus, we propose data parallel algorithms for Sparse Matrix Vector Product in the case of each format. We extract a set of parameters that can help us to select the more suitable compression format for a given sparse matrix. Keywords data parallel model; compression formats; sparse matrix; SMVP I. INTRODUCTION Several scientific applications such as fluid dynamics, circuit simulation, optimization problems, electromagnetics, images processing, etc. lead to sparse linear algebra problems. In most of these applications we handle large sparse matrices. A sparse matrix is generally defined as a matrix with a small number of non-zero elements. However, a matrix can be termed sparse whenever special techniques can be utilized to take advantage of the large number of zero elements and their locations [1]. To minimize memory space and optimize computational performance, user have to exploit the nature of these matrices using particular structure for data storage (storing only the non-zero elements). The chosen format highly depends on the sparse matrix structure that may be either regular or irregular (according to the locations of its non-zero elements). A sparse matrix has a regular structure if all non-zero elements are grouped in the matrix according to a particular pattern (e.g. triangular, diagonal, band matrices, etc.). However, a sparse matrix has an irregular structure (also called general) if their non-zero elements are scattered in the matrix (e.g. variable banded, random, etc.). For each sparse matrix structure, one or more dedicated compressed storage formats are known in the literature. The chosen compression format may affect the sparse algorithm performances. Thus, the choice of the most appropriate is a fairly critical process that must be well studied. Thereby, the study of different compressed formats and their impacts on the sparse computations are the subject of many research projects. Some works focus generally on optimizing the existing sparse compression formats or proposing new ones [2], [3], [4], [5]. These studies were carried out for a given architecture and in order to optimize a specific computing kernel (Sparse Matrix Vector Product in most cases). In [6] authors propose the SPARSITY system which is designed to help users get tuned kernels for sparse matrix computations without having to know the details of their machine memory hierarchy or the way in which the structures of their matrices will be mapped on this hierarchy. In [7], Vuduc et al. proposed the OSKI (Optimized Sparse Kernel Interface) library; it is a collection of low-level primitives providing auto-tuned computing kernels for sparse matrices. However, the choice of the sparse matrix compression format in these libraries is made statically. We notice that current solutions are generally focusing on optimizing only one criterion, such as minimizing storage space or accelerating sparse computing. In [8], we proposed a design of an auto-tuner system that, given a sparse matrix, a numerical method, a parallel programming model and an architecture, can automatically select the Optimal Compression Format (OCF) and propose its associated implementation. For this, we adopted a componentoriented approach to define an extensible system. Thus, the system can be completed progressively, without calling into question its basic principles. We aim that the proposed solution reduces time, memory, communication and energy complexities (Fig.1). Specifically, we proposed a first step of realizing our design. We described the modeling of an auto-tuner system and point out that user expertise is one of the main advantages of our system. Indeed, the user (expert) can guide the decision process by providing information on the application domain as well as the programming and the execution environment /17 $ IEEE DOI /HPCS

2 Fig. 1. Utilization component Thus, we proposed data parallel algorithms for Horner scheme when CSR, CSC, COO and ELL formats are used. We studied two level of data representation: scalar data and vector data. For each level, we defined metrics that allow us to make a choice among the studied formats. In this paper we limit our study to two representative formats namely CSR and CSC format. We start with a theoretical study where we apply a data parallel model for SMVP computing for both formats. To validate our study, we conduct experiments on a set of pathological matrices to comply with the theoretical study. We notice that metrics deduced from data parallel model are not sufficient to make a decision. Therefore, we proposed a new decision process taking into account data parallel model analysis and algorithm analysis. This paper is organized as follows: in section II, we present the data parallel programming model. In section III, we present sparse compression format. Section IV is devoted to presentation of the Sparse Matrix Vector Product computing kernel. In Section V we present the theoretical study. Section VI is devoted to presentation and interpretation of our experimental results. Finally, in section VII, we present concluding remarks and perspectives. II. PARALLEL PROGRAMING MODELS A parallel programming model is linked to the expression of parallel algorithm. It describes its tasks, their interactions and their effects on the computation state. There are two main parallel programming models: task or control parallelism and data parallelism. In the first one, the parallelism results from the application of different tasks on data. Whereas, data parallel algorithm is defined by a unique control flow of instructions. Each of them is executed on a set of data. In this paper, we present our study using the data parallel programming model. Indeed, we apply the same computing kernel on different matrix portions. In data parallel algorithm data are structured onto a virtual geometry representing the layout of virtual processors. In [9], some parameters are presented to evaluate and compare data parallel algorithms. Let: p be the number of virtual processors in the virtual geometry (1D, 2D,...), α be the number of virtual processors concerned by a data parallel operation, cx be the number of data parallel operations (other than communications). The computation ratio for a data parallel operation ops is defined by αops p then, the average computation ratio is: Φ = 1 α ops=cx α ops cx p α ops=1 The best data parallel algorithm to solve a scientific problem is the one with the smallest cx (the numerical aspect is supposed correct). Thus, the data parallel algorithm which maximizes Φ is often chosen since it generates less communications [9]. III. SPARSE COMPRESSION FORMATS Several formats have been proposed to optimize sparse matrix storage. Due to its simplicity and effectiveness, the CSR (Compressed Sparse Row) format remains the most used [1]. To store an n n sparse matrix A with nnz non-zero (1) 668

3 Thus, we evaluated two different optimized implementation of SMVP kernels represented by Algorithm 1 and Algorithm 2 and corresponding to SMVP when the matrix is stored, respectively, in CSR and CSC format. Fig. 2. CSR format elements, CSR format uses three arrays: val, of size nnz, to store the non-zero elements of A on a row-wise (from row 1 to row n), col ind, of size nnz, to store column indices of each stored element in val. and row ptr, of size n + 1, to store pointers on the head of each row in arrays val and col in with row ptr(n) = nnz (Fig.2). In opposite of CSR, the Compressed Sparse Column (CSC) Fig. 3. CSC format uses the arrays val to store the non-zero elements of A on a column-wise [1]. By analogy, it uses two other arrays: row ind, to store row indices of each stored element in val. and col ptr, of size n + 1, to store pointers on the head of each column in arrays val and row ind with col ptr(n) = nnz (Fig.3). IV. SPARSE MATRIX VECTOR PRODUCT KERNEL In this paper, we study Sparse Matrix Vector Product (SMVP) since it is an important computing kernel in many scientific applications that involve iterative operations. Let A be a (n n) sparse matrix with nnz non-zero elements. Our aim is to compute Y = A X, where X is an input dense array with n elements and Y the result array. The SMVP algorithm depends on the compression formats used for the sparse matrix. We present here two different SMVP algorithms corresponding to CSR and CSC compression formats. To improve the performances of SMVP, two manual optimization techniques are applied: scalar replacement and invariant operations removing. Scalar replacement: It consists on replacing an array element (indirect memory access) with a simple scalar variable in order to make faster access. This optimization is recommended in the case when the instruction in question is within a loop. Removing invariant operations: It aims to avoid a not required work by eliminating the loop invariant. Algorithm 1: SMVP for CSR format 1 for i=1 to n do 2 s = 0; 3 e = row ptr[i + 1]; 4 for j= row ptr[i] to e do 5 indx = col ind[j]; 6 s = s + val[j] X[indX]; 7 end 8 Y [i] = s; 9 end Algorithm 2: SMVP for CSC format 1 for i=1 to n do 2 x = X[i]; 3 e = col ptr[i + 1]; 4 for j= col ptr[i] to e do 5 indr = row ind[j]; 6 Y [indr] = Y [indr] + val[j] x; 7 end 8 end V. THEORETICAL STUDY In this section we present compression format study using data parallel programming model. Thus, we start with a theoretical study where, given a matrix, we apply the data parallel model to each proposed format (CSR and CSC). We organize a virtual geometry, on which we map and distribute data relatively to the used compressed format. Based on the given target architecture and the metrics that we will determine, we can then, theoretically, select the OCF. In a second step, we make an experimental study where we compute the SMVP kernel for each compression format. This step helps us analyze the relevance of the theoretical study. We set hypothesis: The architecture is fixed (cluster of multicore CPUs), We choose to not consider communication problems at this point, even it will have an important impact on OCF selecting process. Despite that these criteria have significant impact on the selecting process of the OCF, we choose to not consider them at this point to simplify the selection process. They will be incoporated into our auto-tuner system on futur work. In the following, we present how we extract metrics relative to data parallel programming model in the case of SMVP to select the OCF (among CSR and CSC). In our previous work [8], we studied two different models of data representation: Scalar data parallel operation: we assumed that each non-zero element (scalar data) is mapped to a virtual 669

4 Algorithm 3: Dot product: α = dot(x, Y ) = X T Y 1 α=0; 2 for i=1 to n do 3 α = α + X(i) Y (i); 4 end Algorithm 4: saxpy operation: Y = saxpy(α, X, Y ) = αx + Y 1 for i=1 to n do 2 Y (i) = α X(i) + Y (i); 3 end processor. Therefore, we have a 1D virtual geometry G of p virtual processors. Vector data parallel operation: We noticed that SMVP involves two main vector operations: dot product (Algorithm 3) and saxpy operation (Algorithm 4). Thus, we studied a second level of data parallel operation where each vector is mapped to a virtual processor. In [8], we noticed that we can not make a choice between CSR and CSC formats when we consider data as scalar. Indeed, metrics deduced from [9] were not sufficient to compare CSR and CSC format. Thus, in this paper, we limit our study to the second level of data parallel operation where we handle vector instead of scalar. Thus, We organize a 1D virtual geometry G vect of size p vect = n. Each vector is assigned to one virtual processor. Since we do not take into account communication in this work, the construction of the final result Y is not considered. To implement CSR format, we map the vector X and the sparse matrix rows on the virtual geometry. First, each virtual processor i initializes the i th component of Y with the i th value of X. We compute dot products (Algorithm 3) of each row i of the matrix (val(i, :)) and the vector X using only one data parallel vector operation with complexity O(nzmaxR) (nzmaxr is the maximum number of nnz elements per row). The i th virtual processor of the geometry stores the i th component of A X. To implement CSC format, we map the vector Y and the sparse matrix columns on the virtual geometry. First, each virtual processor j initializes the j th component of Y with the j th value of X. Each virtual processor j updates the vector Y : it performs a saxpy operation (Algorithm 4) involving the j th column of the matrix (val(:, j)) and the j th element of the vector X (X(j)). Let nzmaxc be the maximum number of nnz elements per column, the worst case complexity of this operation is O(nzmaxC)). Let β V ops be the number of virtual processors concerned by a data parallel vector operation V ops. For SMVP, one type of data parallel vector operation V ops is involved: multiplications ( ) (since we do not consider the construction of final result Y ). cx vect be the number of data parallel vector operations (other than communications). Table I represents vector data parallel metrics for CSC and CSR formats. The number of data parallel vector operation cx vect is the same for both tested formats. However, the cost cost V ops of multiplication vector operation is different. Indeed, it depends on either nzmaxr or nzmaxc (since, theoretically, makespan depends on the execution time of the most charged process). We can conclude that the best performance is obtained when we use the format that minimizes vector operations cost i.e. nzmaxr or nzmaxc. We remark, from table I, that the time complexity (cost V ops ) is smaller in the case of CSR (respectively CSC) when nzmaxr <nzmaxc (respectively nzmaxc <nzmaxr). To be conform to this study, we make a pathological case composed of two types of matrices: M atrow (respectively M atcol) with width contiguous dense rows (respectively columns). Figure 4 illustrates examples of pathological matrices. These matrices have the particularity of satisfying the bound criterion (nzmaxr and nzmaxc). Indeed, when the matrix has a structure close to that of a pathological matrix we directly choose the compression format based on data parallel metrics. Otherwise, our system makes a self-update and the decision process will be based on rules database. i.e. we compute the SMVP for each compression format and select the one that minimizes the computing time. Algorithm 5 describes the decision process deducted from the theoretical study. TABLE I VECTOR DATA PARALLEL METRICS CSR CSC cx vect 1 1 p vect n n β n n cost V ops nzmaxr nzmaxc Fig. 4. Example of pathological matrices used in our tests VI. PERFORMANCE EVALUATION Let P P be the number of physical processors. In this section, we evaluate the performances of SMVP when the pathological matrix is stored in CSC or CSR format. Two different cases are studied: P P = p vect and P P < p vect. 670

5 TABLE II DATA ACCESS WHEN P P = p vect #Indirect Access #Read #Write #Operations CSR 3 nzmaxr 4 nzmaxr 2 nzmaxr 2 nzmaxr CSC 4 nzmaxc 4 nzmaxc 2 nzmaxc 2 nzmaxc Fig. 5. Experimental results N=1152, P P = p vect TABLE III DATA ACCESS WHEN P P < p vect # Read # Write # Indirect Access # Operations CSR 3 nrp P tmax + 6 nzp P tmax 4 nrp P tmax + 3 nzp P tmax 2 nrp P tmax + 4 nzp P tmax 2 nzp P tmax CSC 3 ncp P tmax + 6 nzp P tmax 3 ncp P tmax + 3 nzp P tmax 2 ncp P tmax + 5 nzp P tmax 2 nzp P tmax nzp P tmax: number of nnz element in P P tmax nrp P tmax: number of rows in P P tmax ncp P tmax: number of columns in P P tmax A. First case: P P = p vect In the first case, we evaluate our solution using a physical processor number P P equal to that of the virtual geometry p vect. We conduct our experiments on a multicore cluster within Grid5000 platform. The cluster consists of 72 nodes, each node with two hexa-core Intel Xeon E v3 processors (PP=1152 cores in total) is the maximum number of homogeneous processors within Grid5000 platform. Consequently, we evaluate our tests using pathological matrices with n = P P = p vect =1152. The experiments were conducted by measuring the total execution time of SMVP kernel for each matrix. We note that when P P = p vect the total execution time is bounded by (theoretically, equal to) the execution time of the most loaded processor. For the CSR (respectively CSC) format, the most loaded process is the one handling nzmaxr (respectively nzmaxc) elements. Based on the theoretical study (Algorithm 5), for matrices like M atrow (respectively M atcol) the CSC format (respectively CSR format) is optimal. Figure 5 (a) (respectively (b)) illustrates the execution time while applying SMVP for MatRow (respectively MatCol) when matrix A is stored in CSR then in CSC. For MatCol, for any size of width, the experimental results validate the theoretical study. However, we note that in the case of M atrow experimental and theoretical results are coherent for width value less than 750. In the case of the MatRow matrix stored in the CSR format, the number of operations handled by the most loaded processor increases with the width size. However, it doesn t reach the 671

6 Algorithm 5: Decision process from theoretical study 1 Extract Matrix Characteristics(A); 2 if IsClose(Structure(A),Structure(pathological matrix)) then 3 if nzmaxr <nzmaxc then 4 f ormat=csr; 5 else 6 f ormat=csc; 7 end 8 else 9 Update tests database(); 10 Update rules database(); 11 f ormat=select format(a); 12 end Algorithm 6: CSR format: Dot product handled by the most loaded processor when P P = p vect 1 for j= 0 to nzmaxr do 2 indx = col ind[j]; 3 s = s + val[j] X[indX]; 4 end Algorithm 7: CSC format: saxpy operation handled by the most loaded processor when P P = p vect 1 for j= 0 to nzmaxc do 2 indr = row ind[j]; 3 Y [indr] = Y [indr] + val[j] x; 4 end Algorithm 8: Decision process 2 1 Extract Matrix Characteristics(A); 2 if IsClose(Structure(A),Structure(pathological matrix)) then 3 if CSR cost <CSC cost then 4 f ormat=csr; 5 else 6 f ormat=csc; 7 end 8 else 9 Update tests database(); 10 Update rules database(); 11 f ormat=select format(a); 12 end number when the same matrix is stored in the CSR format (5 (f)). Nevertheless, from a certain value of width (750), the CSR provides better performances than the CSC (Fig. 5 (a)). Thus, we can conclude that the metrics given by data parallel model analysis alone does not lead to the right choice in certain cases. These latter can be prevented by an analysis of the algorithm. In fact, when analyzing Algorithm 6 and Algorithm 7, we notice that another factor can impact the execution time which is the data read/write and that depends on nzmaxr and nzmaxc. Table II represents a study of data access number and operation number for SMVP when CSR and CSC formats are used. These values are associated to the most loaded process since the total execution time is bounded by its time. We extract these metrics from Algorithm 6 and Algorithm 7. The most loaded process in the case of CSR (respectively CSC) handles one row (respectively column) with nzmaxr (respectively nzmaxc) elements. Indirect access corresponds to array element access (e.g. val[j]). We notice from Fig.6 that data read (respectively write) number using CSC is, for any value of width, less than that using CSR. Despite the fact that the Read/Write data access influence the total time execution, it can not explain the results presented in fig 5 (a). Although, the number of indirect access to array elements can be the origin of this result. In fact, each indirect access requires an additional time for address computing. In the rest of the paper, Indirect Access denote array element address computing (read or write). Indeed, we remark from Fig.5(c) that the number of indirect access using CSC format is greater than that of CSR format from width=850. Notice that it is not the same value (750) detected in Fig 5 (a). Thus, we conclude that the increasing number of operations, read/write data and indirect access at the same time affect the SMVP performances. Since each format has its own cost, we define: CSR cost = 3 nzmaxr AC cost + 4 nzmaxr R cost + 2 nzmaxr W cost + 2nzmaxR OP cost (2) CSC cost = 4 nzmaxc AC cost + 4 nzmaxc R cost + 2 nzmaxr W cost + 2nzmaxR OP cost (3) where AC cost is the cost of an array element address computing (Indirect Access), R cost is the cost of a Reading access, W cost is the cost of a Writing access, OP cost is the cost of an arithmetic operation (multiplication or addition). The absolute weights for each component (AC cost, R cost, W cost and OP cost ) in Equations 2 and 3 are determined from table II. Therefore, we define a new decision process presented by Algorithm 8. We note that R cost, W cost, AC cost and OP cost are architecture dependent. In this work, we only define theoretical CSR cost and CSC cost. In our future work, we will present how our system extract these parameters. B. Second case: P P < p vect SMVP is not only dependent on nnz/operation number. To validate this conclusion, we analyses the SMVP performances for MatRow matrices when P P < p vect. Hence, each physical processor handles a set of rows (respectively column) in the case of CSR (respectively CSC) format. For that, we use two different approaches for data distribution: MPI partitioning and S-GBNZ algorithm. 672

7 1) MPI partitioning for data distribution: MPI partitioning strategy consists in affecting processes to processors in turn. The first nb process are assigned respectively to the nb node (one core). This step is repeated until processes ending. In our case, each row (respectively column) in the case of CSR (respectively CSC) format is handled by an MPI process. We evaluates the performances of M atrow using CSR and CSC formats when n = (we are limited by the number of MPI processes that can be handled by a core). Experiments were conducted on the same cluster used in section VI-A. The cluster consists of 72 nodes, each node with two hexa-core Intel Xeon E v3 processors. 2) S-GBNZ algorithm for data distribution: The S-GBNZ (Sorted Generalized Fragmentation with Balanced Number of Non-zeros) algorithm is proposed in [10] for the SMVP partitioning on a large scale distributed system. The objective of this algorithm is to decompose a sparse matrix into noncontiguous row blocks by balancing the number of non-zero elements between blocks (Fig. 7). The S-GBNZ approach consists of three phases as follows: Phase 0 is a sorting phase; it consists in sorting the input matrix rows in decreasing number of non-zero elements. Phase 1 is based on LS heuristic (List Scheduling). In the first step, the first p rows of the input matrix are assigned respectively to the p partitions (p is the number of fragments): each row i(i = 1...p) is assigned to the i t h partition. Thereafter, the remaining rows are assigned to the fragments based on their load (number of non-zero elements). In other words, an unassigned current row is always assigned to the least loaded partition until rows ending. Phase 2, which is a phase of improvement, is an iterative heuristic. It allows, through successive refinements, to improve the previous partitioning, leading to a better balance [10]. Experiments are conducted on 10 nodes of the cluster used in section VI-A (160 processors in total). We evaluate the performances of MatRow using CSR and CSC formats when n = In the case of CSC format, S-GBNZ algorithm handles columns instead of rows. 3) Results interpretation: Results presented in this part correspond to the processor with the maximum execution time that we note P P tmax. Table III represents a study of number of data access operation for SMVP when P P < p vect. These metrics are extracted from Algorithm 1 and Algorithm 2. The experimental study, using the MPI partitioning or the S-GBNZ algorithm, leads to the following conclusions: In 50% of cases, the process with maximum execution time is not the same one with maximum load. This results proves that the operation number (deduced from the data parallel analysis) is not a sufficient metric to predict the SMVP performances and select the OCF. We remark from Fig.8 (b) and Fig.9 (b) that the number of indirect access using CSC format becomes greater than that of CSR format from a certain value of width. Using MPI partitioning, we notice from Fig.8 (a) that when width <1000, SMVP using CSC format has better performances than that using CSR format. Indeed, this is due to the low numbers of operations and indirect access of CSC format compared to CSR (see fig 8 (b) and 8 (c)). From a value of width=1000, the variation on the execution time is explained by the increasing numbers of operation indirect access (which become in the case of CSC greater than that of CSR). Using S-GBNZ algorithm, we notice from Fig.9 that when width <500, SMVP performances using CSR and CSC format are the same. This is due to the close number of operations and indirect access for both formats. From a value of width=1000, we remark that the number of indirect access using CSC format becomes slightly greater then that using CSR. Although, there is no difference in performances due to the equal number of operations (Fig.9 (c)). Indeed, as for the operation number, we can conclude that the indirect access number only can t help us making a choice. Fig. 7. Example of using S-GBNZ algorithm for sparse matrix partitioning Therefore, when P P < p vect, we define a more general value of CSR cost and CSC cost : CSR cost = (3 nrp P tmax + 6 nzp P tmax ) R cost + (4 nrp P tmax + 3 nzp P tmax ) W cost Fig. 6. MatRow : number of Read/Write access in the most loaded process, P P = p vect = n = (2 nrp P tmax + 4 nzp P tmax ) AC cost + 2 nzp P tmax OP cost (4) 673

8 Fig. 8. Experimental results using MPI partitioning, MatRow matrices, P P = 1152 < p vect = n = Fig. 9. Experimental results using S-GBNZ algorithm, MatRow matrices, P P = 160 < p vect = n = ACKNOWLEDGMENT CSC cost = (3 ncp P tmax + 6 nzp P tmax ) R cost + (3 ncp P tmax + 3 nzp P tmax ) W cost + (2 ncp P tmax + 5 nzp P tmax ) AC cost + 2 nzp P tmax OP cost (5) Thus, when P P < p vect, the decision process is presented by Algorithm 8 where CSR cost and CSC cost are respectively defined by equation 4 and equation 5. In addition, we note that the used partitioning method can affect the performances. In our future work, an analysis can be produced on the different partitioning methods and can be included in the analysis of the data parallel model. VII. CONCLUSION In this paper, we have studied the performances of SMVP using two different compression formats namely CSR and CSC. The final aim of this work is to define parameters and metrics that, given a sparse matrix, a numerical method, a parallel programming model and an architecture, allows us to select the OCF for a given sparse matrix. Thus, we applied the data parallel model for parallelizing the SMVP algorithm when CSR and CSC formats are used for storing sparse matrix. We defined metrics that allow us to make a choice among the studied formats. To validate our study, we conduct experiments on a set of pathological matrices to comply with the theoretical study. We concluded that extracted metrics from the data parallel model analysis are not sufficient to make decision. Therefore, we define new metrics CSR cost and CSC cost involving the number of operation and the number of indirect data access. Thus, we proposed a new decision process taking into account data parallel model analysis and algorithm analysis. For future work we aim to consider CSR cost and CSC cost metrics to be automatically detected and integrated in the process of selecting the OCF. Furthermore, an analysis on the impact of different partitioning methods can be produced. Experiments presented in this paper were carried out using the Grid 5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see REFERENCES [1] Y. Saad, Iterative Methods for Sparse Linear Systems, second edition ed., [2] A. Yzelmen and R. Bisseling, Two-dimensional cache-oblivious sparse matrix-vector multiplication, Parallel Computing, vol. 37, pp , December [3] R. Bisseling, Parallel scientific computation, a structured approach using bsp and mpi, Utrecht University, Oxford University Press, Tech. Rep., Mai [4] K. Kourtis, G.Goumas, and N. Koziris, Improving the performance of multithreaded sparse matrix-vector multiplication using index and value compression, in 37th International Conference on Parallel Processing, September [5] R. Shahnaz and A. Usman, Blocked-based sparse matrix-vector multiplication on distributed memory parallel computers, International Arab Journal of Information Technology, pp , April [6] E. Im, K.Yelick, and R. Vuduc, Sparsity: Optimization framework for sparse matrix kernels, International Journal of High Performance Computing Applications, vol. 18, pp , Febrary [7] R. Vuduc, J. Demmel, and K. Yelick, Oski: A library of automatically tuned sparse matrix kernels, in SciDAC 2005, june [8] I. Mehrez, O. Hamdi-Larbi, T. Dufaud, and N. Emad, Towards an auto-tuning system design for optimal sparse compression format selection with user expertise, in 13th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2016, November- December [9] S. Petiton and N. Emad, A data parallel scientific computing introduction, Computer Science, vol. 1132, pp , [10] O. Hamdi-Larbi, Z. Mahjoub, and N. Emad, Load balancing in reprocessing of large-scale distributed sparse computation, in 8th LCI Conference on High Performance Clustered computing, May

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision Toshiaki Hishinuma 1, Hidehiko Hasegawa 12, and Teruo Tanaka 2 1 University of Tsukuba, Tsukuba, Japan 2 Kogakuin

More information

Chapter 4. Matrix and Vector Operations

Chapter 4. Matrix and Vector Operations 1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers

Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers Intelligent BEE Method for Matrix-vector Multiplication on Parallel Computers Seiji Fujino Research Institute for Information Technology, Kyushu University, Fukuoka, Japan, 812-8581 E-mail: fujino@cc.kyushu-u.ac.jp

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Sparse Matrix-Vector Multiplication for Circuit Simulation. Abstract

Sparse Matrix-Vector Multiplication for Circuit Simulation. Abstract Sparse Matrix-Vector Multiplication for Circuit Simulation CSE260 (2012 Fall) Course Project Report Hao Zhuang *, and Jin Wang ** Computer Science and Engineering Department University of California, San

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

Mathematics and Computer Science

Mathematics and Computer Science Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR

More information

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

Performance Evaluation of Algorithms for Sparse-Dense Matrix Product

Performance Evaluation of Algorithms for Sparse-Dense Matrix Product Performance Evaluation of Algorithms for Sparse-Dense Matrix Product S. Ezouaoui, Z. Mahjoub, L. Mendili and S. Selmi Abstract We address in this paper the sparse-dense matrix product (SDMP) problem i.e.

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Parallel FFT Program Optimizations on Heterogeneous Computers

Parallel FFT Program Optimizations on Heterogeneous Computers Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture

A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture A Study of SpMV Implementation using MPI and OpenMP on Intel Many-Core Architecture Fan Ye 1,2, Christophe Calvin 1, Serge Petiton 2,3 1 CEA/DEN/DANS/DM2S, CEA Saclay, France 2 LIFL, Université de Lille

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science and Center of Simulation of Advanced Rockets University of Illinois at Urbana-Champaign

More information

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why When Cache Blocking of Sparse Matrix Vector Multiply Works and Why Rajesh Nishtala 1, Richard W. Vuduc 1, James W. Demmel 1, and Katherine A. Yelick 1 University of California at Berkeley, Computer Science

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction

More information

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project

More information

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R.

Q. Wang National Key Laboratory of Antenna and Microwave Technology Xidian University No. 2 South Taiba Road, Xi an, Shaanxi , P. R. Progress In Electromagnetics Research Letters, Vol. 9, 29 38, 2009 AN IMPROVED ALGORITHM FOR MATRIX BANDWIDTH AND PROFILE REDUCTION IN FINITE ELEMENT ANALYSIS Q. Wang National Key Laboratory of Antenna

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Sparse matrices: Basics

Sparse matrices: Basics Outline : Basics Bora Uçar RO:MA, LIP, ENS Lyon, France CR-08: Combinatorial scientific computing, September 201 http://perso.ens-lyon.fr/bora.ucar/cr08/ 1/28 CR09 Outline Outline 1 Course presentation

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

OSKI: Autotuned sparse matrix kernels

OSKI: Autotuned sparse matrix kernels : Autotuned sparse matrix kernels Richard Vuduc LLNL / Georgia Tech [Work in this talk: Berkeley] James Demmel, Kathy Yelick, UCB [BeBOP group] CScADS Autotuning Workshop What does the sparse case add

More information

Report of Linear Solver Implementation on GPU

Report of Linear Solver Implementation on GPU Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

Dynamic Sparse Matrix Allocation on GPUs. James King

Dynamic Sparse Matrix Allocation on GPUs. James King Dynamic Sparse Matrix Allocation on GPUs James King Graph Applications Dynamic updates to graphs Adding edges add entries to sparse matrix representation Motivation Graph operations (adding edges) (e.g.

More information

High-performance Randomized Matrix Computations for Big Data Analytics and Applications

High-performance Randomized Matrix Computations for Big Data Analytics and Applications jh160025-dahi High-performance Randomized Matrix Computations for Big Data Analytics and Applications akahiro Katagiri (Nagoya University) Abstract In this project, we evaluate performance of randomized

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations Hartwig Anzt 1, Marc Baboulin 2, Jack Dongarra 1, Yvan Fournier 3, Frank Hulsemann 3, Amal Khabou 2, and Yushan Wang 2 1 University

More information

Bentley Rules for Optimizing Work

Bentley Rules for Optimizing Work 6.172 Performance Engineering of Software Systems SPEED LIMIT PER ORDER OF 6.172 LECTURE 2 Bentley Rules for Optimizing Work Charles E. Leiserson September 11, 2012 2012 Charles E. Leiserson and I-Ting

More information

Lecture 17: More Fun With Sparse Matrices

Lecture 17: More Fun With Sparse Matrices Lecture 17: More Fun With Sparse Matrices David Bindel 26 Oct 2011 Logistics Thanks for info on final project ideas. HW 2 due Monday! Life lessons from HW 2? Where an error occurs may not be where you

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Solving the heat equation with CUDA

Solving the heat equation with CUDA Solving the heat equation with CUDA Oliver Meister January 09 th 2013 Last Tutorial CSR kernel - scalar One row per thread No coalesced memory access Non-uniform matrices CSR kernel - vectorized One row

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Optimizing the operations with sparse matrices on Intel architecture

Optimizing the operations with sparse matrices on Intel architecture Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.

More information

Auto-tuning Multigrid with PetaBricks

Auto-tuning Multigrid with PetaBricks Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Discrete Optimization. Lecture Notes 2

Discrete Optimization. Lecture Notes 2 Discrete Optimization. Lecture Notes 2 Disjunctive Constraints Defining variables and formulating linear constraints can be straightforward or more sophisticated, depending on the problem structure. The

More information

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics N. Melab, T-V. Luong, K. Boufaras and E-G. Talbi Dolphin Project INRIA Lille Nord Europe - LIFL/CNRS UMR 8022 - Université

More information

Data Structures for sparse matrices

Data Structures for sparse matrices Data Structures for sparse matrices The use of a proper data structures is critical to achieving good performance. Generate a symmetric sparse matrix A in matlab and time the operations of accessing (only)

More information

A priori power estimation of linear solvers on multi-core processors

A priori power estimation of linear solvers on multi-core processors A priori power estimation of linear solvers on multi-core processors Dimitar Lukarski 1, Tobias Skoglund 2 Uppsala University Department of Information Technology Division of Scientific Computing 1 Division

More information

Modified SPIHT Image Coder For Wireless Communication

Modified SPIHT Image Coder For Wireless Communication Modified SPIHT Image Coder For Wireless Communication M. B. I. REAZ, M. AKTER, F. MOHD-YASIN Faculty of Engineering Multimedia University 63100 Cyberjaya, Selangor Malaysia Abstract: - The Set Partitioning

More information

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER AVISHA DHISLE PRERIT RODNEY ADHISLE PRODNEY 15618: PARALLEL COMPUTER ARCHITECTURE PROF. BRYANT PROF. KAYVON LET S

More information

Case study: OpenMP-parallel sparse matrix-vector multiplication

Case study: OpenMP-parallel sparse matrix-vector multiplication Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)

More information

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Parallel and perspective projections such as used in representing 3d images.

Parallel and perspective projections such as used in representing 3d images. Chapter 5 Rotations and projections In this chapter we discuss Rotations Parallel and perspective projections such as used in representing 3d images. Using coordinates and matrices, parallel projections

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel and Francisco F. Rivera Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Consensus clustering by graph based approach

Consensus clustering by graph based approach Consensus clustering by graph based approach Haytham Elghazel 1, Khalid Benabdeslemi 1 and Fatma Hamdi 2 1- University of Lyon 1, LIESP, EA4125, F-69622 Villeurbanne, Lyon, France; {elghazel,kbenabde}@bat710.univ-lyon1.fr

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Classification of Partioning Problems for Networks of Heterogeneous Computers

Classification of Partioning Problems for Networks of Heterogeneous Computers Classification of Partioning Problems for Networks of Heterogeneous Computers Alexey Lastovetsky and Ravi Reddy Department of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland {alexey.lastovetsky,

More information

Cache Oblivious Matrix Transpositions using Sequential Processing

Cache Oblivious Matrix Transpositions using Sequential Processing IOSR Journal of Engineering (IOSRJEN) e-issn: 225-321, p-issn: 2278-8719 Vol. 3, Issue 11 (November. 213), V4 PP 5-55 Cache Oblivious Matrix s using Sequential Processing korde P.S., and Khanale P.B 1

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,

More information

1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY

1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY .D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY ENVER KAYAASLAN, BORA UÇAR, AND CEVDET AYKANAT Abstract. There are three common parallel sparse matrix-vector multiply algorithms: D row-parallel, D column-parallel

More information

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea. Abdulrahman Manea PhD Student Hamdi Tchelepi Associate Professor, Co-Director, Center for Computational Earth and Environmental Science Energy Resources Engineering Department School of Earth Sciences

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context 6 th. International Work Conference on Artificial and Natural Neural Networks (IWANN2001), Granada, June 13-15 2001 Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance

More information

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2 CHAPTER Mathematics 8 9 10 11 1 1 1 1 1 1 18 19 0 1.1 Introduction: Graphs as Matrices This chapter describes the mathematics in the GraphBLAS standard. The GraphBLAS define a narrow set of mathematical

More information

Blocking Optimization Strategies for Sparse Tensor Computation

Blocking Optimization Strategies for Sparse Tensor Computation Blocking Optimization Strategies for Sparse Tensor Computation Jee Choi 1, Xing Liu 1, Shaden Smith 2, and Tyler Simon 3 1 IBM T. J. Watson Research, 2 University of Minnesota, 3 University of Maryland

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1

Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1 Available on-line at www.prace-ri.eu Partnership for Advanced Computing in Europe Analyzing and Enhancing OSKI for Sparse Matrix-Vector Multiplication 1 Kadir Akbudak a, Enver Kayaaslan a, Cevdet Aykanat

More information

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve

A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A. N. Yzelman and Rob H. Bisseling Abstract The sparse matrix vector (SpMV) multiplication is an important kernel

More information

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström

More information

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO

BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO BUILDING HIGH PERFORMANCE INPUT-ADAPTIVE GPU APPLICATIONS WITH NITRO Saurav Muralidharan University of Utah nitro-tuner.github.io Disclaimers This research was funded in part by the U.S. Government. The

More information

A MRF Shape Prior for Facade Parsing with Occlusions Supplementary Material

A MRF Shape Prior for Facade Parsing with Occlusions Supplementary Material A MRF Shape Prior for Facade Parsing with Occlusions Supplementary Material Mateusz Koziński, Raghudeep Gadde, Sergey Zagoruyko, Guillaume Obozinski and Renaud Marlet Université Paris-Est, LIGM (UMR CNRS

More information

Parallel solution for finite element linear systems of. equations on workstation cluster *

Parallel solution for finite element linear systems of. equations on workstation cluster * Aug. 2009, Volume 6, No.8 (Serial No.57) Journal of Communication and Computer, ISSN 1548-7709, USA Parallel solution for finite element linear systems of equations on workstation cluster * FU Chao-jiang

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures Paper 6 Civil-Comp Press, 2012 Proceedings of the Eighth International Conference on Engineering Computational Technology, B.H.V. Topping, (Editor), Civil-Comp Press, Stirlingshire, Scotland Application

More information

An Asynchronous Algorithm on NetSolve Global Computing System

An Asynchronous Algorithm on NetSolve Global Computing System An Asynchronous Algorithm on NetSolve Global Computing System Nahid Emad S. A. Shahzadeh Fazeli Jack Dongarra Abstract The Explicitly Restarted Arnoldi Method (ERAM) allows to find a few eigenpairs of

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Keywords - DWT, Lifting Scheme, DWT Processor.

Keywords - DWT, Lifting Scheme, DWT Processor. Lifting Based 2D DWT Processor for Image Compression A. F. Mulla, Dr.R. S. Patil aieshamulla@yahoo.com Abstract - Digital images play an important role both in daily life applications as well as in areas

More information

An Information Hiding Scheme Based on Pixel- Value-Ordering and Prediction-Error Expansion with Reversibility

An Information Hiding Scheme Based on Pixel- Value-Ordering and Prediction-Error Expansion with Reversibility An Information Hiding Scheme Based on Pixel- Value-Ordering Prediction-Error Expansion with Reversibility Ching-Chiuan Lin Department of Information Management Overseas Chinese University Taichung, Taiwan

More information

Particle Swarm Optimization Based Approach for Location Area Planning in Cellular Networks

Particle Swarm Optimization Based Approach for Location Area Planning in Cellular Networks International Journal of Intelligent Systems and Applications in Engineering Advanced Technology and Science ISSN:2147-67992147-6799 www.atscience.org/ijisae Original Research Paper Particle Swarm Optimization

More information

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University

More information

A substructure based parallel dynamic solution of large systems on homogeneous PC clusters

A substructure based parallel dynamic solution of large systems on homogeneous PC clusters CHALLENGE JOURNAL OF STRUCTURAL MECHANICS 1 (4) (2015) 156 160 A substructure based parallel dynamic solution of large systems on homogeneous PC clusters Semih Özmen, Tunç Bahçecioğlu, Özgür Kurç * Department

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Assignment 1 (concept): Solutions

Assignment 1 (concept): Solutions CS10b Data Structures and Algorithms Due: Thursday, January 0th Assignment 1 (concept): Solutions Note, throughout Exercises 1 to 4, n denotes the input size of a problem. 1. (10%) Rank the following functions

More information

GraphBLAS Mathematics - Provisional Release 1.0 -

GraphBLAS Mathematics - Provisional Release 1.0 - GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,

More information