Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Size: px

Start display at page:

Download "Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993."

Erik Benson
5 years ago
Views:

1 Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel Processing for Scientic Computing",Norfolk, Virginia (USA), March Abstract Many iterative schemes in scientic applications require the multiplication of a sparse matrix by a vector. This kernel has been mainly studied on vector processors and shared-memory parallel computers. In this paper, we address the implementation issues when using a shared virtual memory system on a distributed memory parallel computer. We study in details the impact of loop distribution schemes in order to design an ecient algorithm. 1 Introduction Many scientic applications require computations on large sparse matrices. Most iterative schemes include the multiplication of a sparse matrix by a vector, which should be ecient on parallel computers. The algorithm depends directly on the storage scheme chosen. Various schemes have been devised as explained in [1]. This paper deals with a parallel version of this kernel designed for a distributed memory parallel computer (DMPC) supplied with a Shared Virtual Memory (SVM). The basic idea of a SVM is to hide the underlying architecture of DMPCs by providing a virtual address space to the user. This latter is partitioned into pages spread among local processor memories. Each local memory acts as a large software cache for storing pages requested by the processor. DMPC could be thus programmed as more conventional shared memory parallel computers. However, accessing data through a SVM can dramatically decrease the eciency of the parallel algorithm if data locality is not well exploited. The aim of this paper is to demonstrate that using both an adequate data storage and a loop distribution scheme can keep low this overhead. Blocked matrix vector multiply is not considered. Such an algorithm can decrease communication cost on vector broadcasting and reduction, however an ecient use of the SVM, in that case, would require a new storage of the matrix and so would introduce many changes in Fortran programs. We also do not address the use of the internal processor cache studied in [6]. The paper is organized as follows. Sections 2 and 3 describe briey the KOAN Shared Virtual Memory and Fortran-S compiler targeted to KOAN. Section 4 is devoted to the description of the parallel algorithm, mainly to the impact of sharing the resulting vector. A specic loop partition scheme is designed, in order to distribute the workload. Section 5 gives and comments experimental results on various general non symmetric sparse matrices. This work is partially supported by Intel SSD under contract no C y IRISA - Campus Universitaire de Beaulieu - F RENNES Cedex - FRANCE 1

2 2 Bodin et al. 2 KOAN: a Shared Virtual Memory for the ipsc/2 KOAN is a Shared Virtual Memory designed at IRISA for the Intel hypercube ipsc/2 [4]. It is embedded into the operating system of the ipsc/2. It allows the use of fast, low-level communication primitives as well as a Memory Management Unit (MMU). Pages are managed by using the xed distributed manager algorithm as described in [5] and consistency of the shared memory is guaranteed by an invalidation protocol. This algorithm represents a good tradeo between ease of implementation and eciency. Let us now summarize some of the functionalities of the KOAN SVM system. KOAN SVM provides to the user several memory management protocols for handling some particular memory access patterns. An important one occurs when several processors have to write into dierent locations of the same page. This pattern involves a lot of messages since the page has to move from processor to processor (ping-pong eect or false sharing). A weak cache coherence protocol can be used to let the processors modify concurrently their own copy of a page and to merge local copies into the shared one afterwards. Parallel algorithms based on a producer/consumer scheme are inecient on DMPCs when using a SVM system. Typically, a page is rst updated by a processor then accessed by the other processors. KOAN SVM can manage eciently this memory access pattern by using the broadcasting facility of the underlying topology of DMPCs (hypercube, 2D-mesh, etc...). The producer processor has to broadcast all the pages updated in the producer phase to the parallel consumer processors. These two memory management protocols are available to the user through several operating system calls that are used to specify a program section where a weak cache coherence protocol has to be used or to bound a producer phase. 3 Fortran-S: A programming interface for KOAN A \user-friendly" programming environment for DMPC has been designed at IRISA by providing high-level parallel constructs. The user's code is written in standard Fortran-77 and contains directives to express parallel loops and shared variables (A shared variable can be read or written by all the processors). Shared variables are used for data structures that can be computed in parallel. So parallel loops are intended to compute values of the shared variables declared in the program. Parallel execution is achieved using a SPMD execution model (Single Program Multiple Data). At the beginning of the program execution, a thread is created on each processor and each processor starts to execute the program. All non shared variables are duplicated on the processors. Shared memory space is allocated by the KOAN SVM at the beginning of the execution. For executing parallel do loops the compiler distributes chunks of the iteration space to each processor according to the iteration distribution specied in the directives. The Fortran-S compiler is in charge of generating parallel processes with KOAN low level operating system calls from the source code. This approach provides a convenient parallel programming environment for several reasons. It allows an easy and ecient way of programming parallel algorithms ; moreover it facilitates debugging since the programs can be compiled and executed on a workstation. Though parallelism is based on a sharedmemory approach, the programming environment also provides message based primitives that can be used to handle eciently global operations and synchronizations. The prototype compiler is implemented using the Sigma-II system developed at Indiana University [3]. The compilation technique is not described further in this paper to keep it short. In the remaining part of this section, we give a brief overview of the main directives.

3 Sparse Matrix-Vector Product ; Shared Virtual Memory 3 In Fortran-S, shared variables must be declared explicitly by means of a directive and must be declared in the main program. Other variables are non-shared, namely each processor has each own copy of the variable. A shared variable is declared using the following directive : REAL V(N,N) C$ann[Shared(v)] The iterations of a parallel loop are distributed among the processors and the processors are synchronized at the end of all the iterations. Several static scheduling strategies are provided by the compiler and the user can dene its own strategy which may be dynamic. Among these strategies, the compiler can allocate chunks of consecutive iterations to processors or can distribute the iterations cyclically. These schedulings provide a good load-balancing if the work in the iterations is almost equally distributed. Otherwise more sophisticated schemes must be used. Therefore the Fortran-S compiler accepts user-dened iterations partitions, which can be either static or dynamic (i.e. set at run time). A parallel loop is declared using the directive : C$ann[DoShared("scheduling")] do 20 nel = 1,nbnel sounds(nel) = sqrt(gama * p(nel) / ro(nel)) 20 continue where the string "scheduling" indicates the scheduling strategy of the iterations and can take the value : 1. "BLOCK": chunks of contiguous iterations are assigned over the processors. 2. "CYCLIC": iterations are distributed over the processor according to a modulo scheme. The rst iteration is aected to the rst processor, second to the second processor, and so on. 3. "USER": User-dened partitions. The partition of the iteration space is specied by the user, at run-time. This feature is important when load balancing depends on the data of the program. A weak cache coherence protocol as described briey in section 2 can be associated with a shared variable during the execution of a parallel loop thanks to the following directive: C$ann[WeakCoherency(y)] An example of use is given below : C$ann[DoShared("BLOCK")] C$ann[WeakCoherency(y)] do 1 i=1,n y(i) = f(...) 1 continue In this example y is assumed to be a shared variable written simultaneously by many processors, so there may be false sharing on pages where the variable y is stored. The weak coherence protocol removes that phenomenon, by merging updates of the shared pages only once, at the end of the loop. 4 Algorithm and Data Structures The sparse matrix by a vector multiplication is a CPU-intensive kernel found in most iterative schemes. The algorithm depends on the storage scheme, which may include some

4 4 Bodin et al. zeros or not. Here we choose a compressed storage by rows which is commonly used and which is well-suited for parallel multiplication. Let n be the order of the matrix and nz be the number of non zeros elements. A real array a of length nz contains all the coecients while an integer array ja of same length nz contains the corresponding column indices. An auxiliary integer array ia of length (n + 1) points to the rst element of each row. An example of a sparse matrix with this storage is given below : A = A a(1 : nz) ja(1 : nz) ia(1 : n + 1) An intrinsic parallelism derives readily from the storage scheme. Namely, since the sparse matrix multiplication is expressed by rows, it is sucient to partition the matrix by rows and to handle the dierent blocks of rows in parallel. More precisely, the sequential algorithm is composed of an outer loop on rows with an inner loop on the elements of the rows, as indicated by the program below. The outer iterations on the rows are clearly parallel so that they can be assigned to parallel tasks. Below is the sequential algorithm along with the parallel version. c shared variables : a,ja,y do i=1,n y(i) = 0. do k=ia(i),ia(i+1)-1 y(i) = y(i) + a(k)*x(ja(k)) end do end do c private variables : n,i,k,temp,x,ia C$ann[DoShared("sched")] C$ann[WeakCoherency(y)] do i=1,n temp = 0. do k=ia(i),ia(i+1)-1 temp = temp + a(k)*x(ja(k)) end do y(i)=temp end The data structures are the operand vector x, the resulting vector y and the sparse matrix composed of three vectors a; ia; ja. Only the vector y is written, others are readonly. The vectors x and ia of length n can be copied in local memories. On the contrary, we assume that the vectors a and ja cannot be duplicated because of memory requirements. In our environment, programming is simplied thanks to the use of shared arrays which are managed by KOAN and the compiler. Since the scheduling chosen for the algorithm requires a data distribution that diers from the initial distribution, the system has to move pages during the rst executions of the multiplication. But after a few iterations, the pages will be located correctly and will stay in the local memories provided there is enough space, so that the SVM overhead becomes negligible. Here we do not deal with further uses of the operand vector x in the application. In general, the matrix vector multiply would require a global broadcast of the vector x or implicit copies of it into local memories via the SVM, introducing a meaningful overhead. As far as the resulting vector y is concerned, two cases must be studied. If the vector is used along the application with the same static partition, it may remain local to the processors. Thanks to the SPMD programming model, if the vector y is not declared as shared, each processor will get partial results. On the other hand, if its next use require a dierent partition, the vector must be shared. We use the weak coherence protocol to

5 Sparse Matrix-Vector Product ; Shared Virtual Memory 5 eliminate false sharing when updating the shared array y. The rows must be partitioned in order to balance the workload. The simplest strategy, which is provided by automatic parallelizers for instance, is to divide the rows into slices of almost equal size. This has been done for example in [7] to implement a sparse linear solver on a KSR computer. However, the load of the algorithm is not measured by the order of the matrix, but rather by its number of non-zeros. Therefore, a better partition consists for some matrices in computing slices with almost the same number of non-zeros but with unequal numbers of rows. We have experimented both strategies, by using the directive "DoShared" with the two options called "BLOCK" and "USER", where "USER" implements a partition of the iteration space into blocks of roughly the same number of non-zeros. In the last strategy the vector ia is used to compute the iteration space partition among the processors. Because that operation is done once (the matrix structure is not modied between two matrix vector multiplies, so the computed scheduling stays valid) and because the vector ia is a private vector, the overhead is kept low. 5 Numerical Results The algorithm is implemented using Fortran-S on a hypercube ipsc/2 with at most 32 processors. It should be noted that the peak performance of one processor is only 0:3 Mops. The code is tested on various matrices (see table 1) in order to measure the impact of the order and the number of non-zeros. They are band matrices or come from the Harwell-Boeing collection [2]. The resulting vector y is either distributed in local memories (vector y is private, so the results of the matrix vector multiply are distributed in the local copies of the vector y) or managed by the shared virtual memory (delared as shared) using a weak cache coherence protocol. The performance dierences between the two experiments measure the cost of the weak coherence protocol used on vector y. The two scheduling strategies described above (BLOCK and USER) are also tested. Timings in seconds and rates in Mops are provided for hundred iterations in single precision (see tables 2,3,4,5). Large problem results are not given for on one or two processors because the size of the data exceeded the available memory. Experimental results show that the size of the problem is well measured by the number of non zeros in the matrix. Indeed, the results on the band matrices show similar performances, mainly for the version with distributed vectors, because they have almost the same number of non-zeros though they have dierent orders. Performance degradation comes from both load imbalance and the merging of the pages where parts of the vector y are stored (this is due to the weak coherence protocol used). The version with distributed vectors gives very good speed-ups, even for small problems with 32 processors, since data transfers between processors only occur during the rst matrix-vector multiply. We recall that here the operand vector x is not updated. The overhead induced by the merge into the shared vector is proportional to the maximal number of processors writing to a same page when merging the shared array. In case of the BLOCK strategy, this quantity can be roughly estimated by max(2; min(p; c=(n=p))) where c is the page size (here it is 1024 words of 32-bits) p is the number of processors and n is the order of the matrix. Small matrices such as bcspwr09 and band24 lead to a high overhead with 32 processors because the shared array holds in two pages. But for large n, the merge operation becomes relatively cheap since at most two processors share a same page. Load imbalance is particularly sensitive for the matrix orani678 because the distribution

6 6 Bodin et al. Table 1 Sparse Matrices Used. matrix order non-zeros bcspwr lns band orani band bcsstk Table 2 Results with distributed vectors and with row partitioning (BLOCK). matrix Number of processors bcspwr09 (s) MFLOPS lns3937 (s) MFLOPS band5 (s) MFLOPS orani678 (s) MFLOPS band20 (s) MFLOPS bcsstk24 (s) MFLOPS of non zeros is far from uniform. The rst partition (BLOCK) leads to poor performances whereas the second partition (USER) gives better performances by distributing equally the work. For band matrices where the rows have roughly the same number of non-zeros, both strategies yield similar results. We have measured the SVM overhead due to page moves for each iteration. As expected, the rst iteration is quite costly but in the next ones, this overhead can eectively be neglected. 6 Conclusion Results presented in this paper demonstrate that the use of a Shared Virtual Memory is an ecient way for programming distributed memory parallel computers. Our environment provides tools, based on directives, which automatically distribute both data and computations. However, users still have to carefully distribute the loop iterations in order to balance the workload and to limit as much as possible remote access through the SVM. We shown that computations involved in a sparse matrix vector multiply can be easily distributed by using an adequate loop scheduling strategy. Concerning data locality, the overhead induced by reading the matrix comes from an initial system distribution not related to the loop distribution. But these page moves become negligible as soon as the number of iterations calling the sparse matrix vector multiply becomes sizable. The main limitations to the speed-up are due to the eective sharing of the resulting vector and the operand

7 Sparse Matrix-Vector Product ; Shared Virtual Memory 7 Table 3 Results with one shared vector and with row partitioning (BLOCK). matrix Number of processors bcspwr09 (s) MFLOPS lns3937 (s) MFLOPS band5 (s) MFLOPS orani678 (s) MFLOPS band20 (s) MFLOPS bcsstk24 (s) MFLOPS Table 4 Results with one shared vector and with non-zeros partitioning (USER). matrix Number of processors bcspwr09 (s) MFLOPS lns3937 (s) MFLOPS band5 (s) MFLOPS orani678 (s) MFLOPS band20 (s) MFLOPS bcsstk24 (s) MFLOPS

8 8 Bodin et al. Table 5 Results on band matrices (BLOCK). 8 processors matrix order non zeros shared distributed band5 (s) MFLOPS band10 (s) MFLOPS band24 (s) MFLOPS processors band5 (s) MFLOPS band10 (s) MFLOPS band24 (s) MFLOPS vector. We studied here the rst eect on a shared virtual memory and found that the overhead becomes small for matrices of large order. The second limitation occurs in an application calling iteratively the kernel and distributing the vector x in each iteration. Since this overhead increases with the number of processors, it may become the main bottleneck of an application ([7]). We plan to investigate this problem and the global data usage by implementing a linear sparse iterative solver requiring a sparse matrix vector multiply. References [1] I. Du, A. Erisman, J. Reid, Direct methods for sparse matrices. Oxford University Press, London, [2] I. Du, R. Grimes, and J. Lewis, Sparse matrix test problems, ACM TOMS, 15 (1989), pp. 1{14. [3] Dennis Gannon and Jenq Kuen Lee and Bruce Shei and Sekhar Sarukai and Srivinas Narayana and Neelakantan Sundaresan and Daya Atapattu and Francois Bodin. Sigma II: a tool kit for building parallelizing compilers and performance analysis systems. Proceedings of the IFIP Edinburgh Workshop on Parallel Programming Environments, April [4] Z. Lahjomri and T. Priol. Koan: a shared virtual memory for the ipsc/2 hypercube. In CONPAR/VAPP92, September [5] Kai Li. Shared Virtual Memory on Loosely Coupled Multiprocessors. PhD thesis, Yale University, September [6] O. Temam, W. Jalby, Characterizing the behavior of sparse algorithms on caches Proceedings of Supercomputing'92, pp [7] D. Windheiser, E. Boyd, E. Hao, S. Abraham, KSR1 Multiprocessor : Analysis of Latency Hiding Techniques in a Sparse Solver Research Report, University of Michigan, November 1992.

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5 On Improving the Performance of Sparse Matrix-Vector Multiplication James B. White, III P. Sadayappan Ohio Supercomputer Center Ohio State University Columbus, OH 43221 Columbus, OH 4321 Abstract We analyze