Optimizing Sparse Matrix Operations on GPUs using Merge Path

Size: px

Start display at page:

Download "Optimizing Sparse Matrix Operations on GPUs using Merge Path"

Todd Ray
5 years ago
Views:

21 IEEE 29th International Parallel and Distributed Proessing Symposium Optimizing Sparse Matrix Operations on GPUs using Merge Path Steven Dalton, Luke Olson Department of Computer Siene University

1 21 IEEE 29th International Parallel and Distributed Proessing Symposium Optimizing Sparse Matrix Operations on GPUs using Merge Path Steven Dalton, Luke Olson Department of Computer Siene University of Illinois at Urbana-Champaign Urbana, IL, {dalton6, Sean Baxter, Duane Merrill, Mihael Garland Nvidia Researh Santa Clara, CA, 9 {sbaxter, dmerrill, mgarland}@nvidia.om Abstrat Irregular omputations on large workloads are a neessity in many areas of omputational siene. Mapping these omputations to modern parallel arhitetures, suh as GPUs, is partiularly hallenging beause the performane often depends ritially on the hoie of data-struture and algorithm. In this paper, we develop a parallel proessing sheme, based on Merge Path partitioning, to ompute segmented row-wise operations on sparse matries that exposes parallelism at the granularity of individual nonzeros entries. Our deomposition ahieves ompetitive performane aross many diverse problems while maintaining preditable behavior dependent only on the omputational work and ameliorates the impat of irregularity. We evaluate the performane of three sparse kernels: SpMV, SpAdd and SpGEMM. We show that our proessing sheme for eah kernel yields omparable performane to other shemes in many ases and our performane is highly orrelated, nearly 1, to the omputational work irrespetive of the underlying struture of the matries. Keywords-parallel; gpu; sparse matrix-matrix; sorting; I. INTRODUCTION Operations that are naturally deomposed into a fixed number of equally sized omponents are inherently amenable to parallel proessing. However, proessing operations in whih the workload is highly irregular and data-dependent is substantially more hallenging. For finegrained parallel proessing environments, suh as GPUs, the issues of effetively deomposing and mapping irregular work to omputational units is amplified due to the hierarhial struture of the threads in the arhiteture and the relatively small working set size of eah individual thread. In this work we study the performane of three operations on sparse matries, sparse matrix-vetor multipliation (SpMV), sparse matrix-sparse matrix addition (SpAdd), and sparse matrix-sparse matrix multipliation (SpGEMM) and propose several strategies to ahieve preditable performane, meaning the proessing time is highly orrelated with the total work and independent of the underlying struture of the input matries. The primary hallenge of omputations involving sparse matries originates from workload imbalanes indued by data segmentation. To address irregularity many methods have been introdued that struture memory aesses to improve the performane of sparse matrix operations. Arguably the most studied and optimized, in terms of datastrutures and algorithms, is SpMV. Peak SpMV performane is ahieved by oupling sparsity pattern analysis with speialized, and in some ases exoti, storage shemes tuned for a partiular lass of matries. The drawbaks of this approah are the need to perform analysis of the input matrix and the use of storage formats that are not amenable for use in other operations. For a partiular set of matries a speialized algorithm-storage pair may yield optimal performane on a given arhiteture but a highly tuned pair may be invalid or exhibit unpreditable behavior when applied to general matries. We onsider these approahes as sparsity or segmentation aware sine they inorporate matrix struture into the implementation. In this work we avoid the use of segmentation aware algorithms or speialized data strutures. Our approah plaes load-balaning first and performs segmented rowwise operations progressively by introduing intermediate work. To ahieve this we study balaned deompositions of eah sparse matrix workload and perform proessing at the granularity of nonzero entries during all phases. In the ase of SpAdd and SpGEMM our proessing sheme builds on highly regular merge-based sorting routines to partition work into smaller units of roughly equal size for parallel proessing. In ontrast with traditional metris of interest, suh as giga-floating point operations per seond (GFLOPs/s), our method is not designed to ahieve the best absolute performane ompared to other segmentation aware algorithms. Indeed we show the performane of our approah may be notably lower than other methods in some ases. However, we also show that our method is easily preditable for a wide diversity of inputs. The ontributions of this work are as follows: Balaned deompositions of three sparse matrix kernels: SpMV, SpAdd, and SpGEMM. Extension of merge-path partitioning to perform set unions /1 $ IEEE DOI 1.119/IPDPS

2 II. BACKGROUND Modern GPUs are representative of a lass of proessors that seek to maximize the performane of massively data parallel workloads by employing hundreds of hardwaresheduled threads organized into tens of proessor ores. Suh arhitetures are onsidered throughput-oriented in ontrast to traditional, lateny-oriented, CPU ores whih seek to minimize the time to omplete individual tasks [1]. GPUs organize proessors into a hierarhy of omputational bloks starting from a fixed size group of threads, known as warp, exeuting in a SIMD fashion. Warps are aggregated into large bloks sharing a fast loal memory buffer, shared memory, exeuting within a thread blok or onurrent thread array (CTA). CTAs are further grouped into a larger set, known as a grid, exeuting a single program. This hierarhial exeution model exhibits the highest utilization when omputations are regularly-strutured and evenly distributed among CTAs within a grid. Sorting is a important lass of omputational algorithms and effiient GPU implementations of radix and merge sort have been presented [2], [3]. Though radix sorting an array of N elements, with an average key-length k, may ahieve lower omplexity bounds, O(kN), than omparison based methods, O(N logn), it has limited appliability and fails to exploit approximate sorted-ness of the input sequene. Reently merge path has been introdued as a effiient means of improving the performane of merge based sorting algorithms [4], []. Merge path, illustrated in Figure 1a, operates by partitioning the inputs, A and B, among parallel threads in a way that provides: 1) Equal amounts of work for all threads. 2) Minimal ommuniation and synhronization between parallel threads. SpMV operations are at the ore of many sparse iterative solvers and as suh have been the fous of a large body of researh to optimize performane [6], [7]. Bell et al. give a omprehensive overview of SpMV onsiderations on the GPU and propose a novel hybrid format to balane the tradeoff of performane and generality [8]. Subsequent researh on GPU SpMV have onsidered sparsity aware adaptive proessing and various algorithmi enhanements to redue memory traffi [9] [11]. Sequential SpGEMM algorithms [12], rely on a large amount, O(n), of temporary storage to effiiently store and redue unique entries on any row of the output matrix C R m n. Although tehniques have been developed to expand the salability and generality of suh parallel deompositions [13], the level of parallelism remains too oarse-grained for GPU aeleration. One novel GPU SpGEMM algorithm deomposes the operation into expansion, sorting, and ompression (ESC) [14]. The ESC algorithm was improved by proessing C row-wise aording to the distribution of work, whih is a measure of the number of produts, omputed during analysis of the input matries [1]. Though highly effiient for sparse-matrix pairs generating intermediate produts readily proessed in shared memory their approah did not impat the performane of long intermediate rows proessed in global memory. To address long intermediate rows Liu et al. presented a mergebased proessing sheme that distributed entries on long intermediate rows evenly for parallel proessing in shared memory whih ahieved notable performane improvement for ertain matries [16]. In this work we explore segmentation oblivious methods to proess general redutions on sparse matries in order to redue the impat of irregularity while maintaining performane. For ompliated deompositions involving sparse matrix pairs, SpAdd and SpGEMM, we leverage the merge path strategy to exploit the partial ordering of the input and perfetly balane proessing of the intermediate work aross CTAs. III. SPARSE MATRIX OPERATIONS Given two matries A R m p, B R p n, and a vetor x R p 1 then SpMV, SpAdd, and SpGEMM ompute Ax, A + B, A B, respetively. In ontrast with general (dense) matrix-matrix operations, A and B are assumed to be sparse. Typial storage formats inlude oordinate (COO), where eah nonzero is represented as a (row, olumn, value) tuple, and ompressed sparse formats that sort and ompress the COO format along either the row (CSR) or olumn (CSC) axis and provides offsets to the first entry in eah segment. We onsider the following sparse matries in our algorithmi desriptions for Setions III-A, III-B and III-C, 1 1 A = and B = 2 3 4, with tuple form given by A. SpMV (,, 1) (1, 1, 2) A = (1, 2, 3) (1, 3, 4) (2, 3, ) (3, 1, 6) (,, 1) (1, 1, 2) (1, 3, 3) and B = (2,, 4). (2, 1, ) (3, 1, 6) (3, 3, 7) For SpMV we onsider Ax = y, where A is in CSR format and y R m 1, and the total work is two times the number of nonzeros in A, denoted A. The obvious parallelization, known as salar CSR, is to assign one thread 48

3 to eah row of A and proess eah row independently. Unfortunately this work deomposition may exhibit a large amount of imbalane between threads if the number of nonzeros per rows of A vary signifiantly. One solution is to vetorize the row-wise proessing sheme and enlist a fixed number of threads within a warp or CTA to proess eah row of the A. This sheme is advantageous when: the average number of entries per row does not vary signifiantly, eah row ontains a relatively large number of entries relative to the number of threads, and there are enough rows to fully saturate the devie. However, it may be impossible to globally selet a stati number of threads appropriate for every row of A. We avoid possible irregularities assoiated of row-wise proessing by assigning a fixed number of nonzeros, N prods, per CTA. To implement this strategy we deompose the SpMV kernel into three phases: partition, redution, and update. Although a flat nonzero based deomposition is perfetly balaned row-wise segmentation is required to enat the redution orretly. Proessing the matries in COO format is one alternative but requires the additional storage and movement of one row entry per nonzero. We address this issue, for sparse matries in CSR format, during the partitioning phase whih omputes the row-wise limits of the entries proessed per CTA. For A entries we proess the produts using m = A /N prods CTAs. During the partitioning phase we enlist m threads to ooperatively perform one binary searh per CTA segment to loate the last offset in the row offsets array proessed by eah CTA and storing these offsets in an auxiliary global memory buffer, S. After partitioning the rows the redution phase launhes m CTAs to redue produts within eah segment of A. A given CTA, i, proesses entries in the range [i N prods, min( A, (i +1) N prods )] and row offsets orresponding to S[i, i +1] are loaded into shared memory to generate the expanded row indies orresponding to eah nonzero. To ompute the redution we first load the olumn indies and values of A, in strided order to redue the penalty of global memory operations, into register. Then we dereferene entries from x aording to the olumn indies and ompute the saled produts. The produts are transposed from strided to bloked order using shared memory and a CTA wide segmented san is performed. On enountering disontinuities in the row indies partial sums are stored to y and the CTA remainder, orresponding to the last row, is stored to a global memory buffer, r. Lastly, during the update phase we aumulate the inter- CTA updates by performing a segmented san on r and augmenting the first redution for eah CTA with the arry out value from the previous CTA. This approah exposes parallelism without regard for segment geometry allowing it to sale over a wide range of sparsity patterns. We statially tune the number of entries per thread empirially Algorithm 1: Tuple Ordering parameters: T1=(row 1,ol 1 ), T2=(row 2,ol 2 ) return: MIN(T1,T2) if row 1 < row 2 return T1 if row 2 < row 1 return T2 if ol 1 <= ol 2 return T1 return T2 to maximize throughput. B. Addition For SpAdd we onsider the total work to form C as the sum of the number of nonzero entries in A and B, A + B. Two approahes for omputing SpAdd are rowwise segmented redutions and global sorting. In the roworiented sheme eah row of C is proessed by a thread group. Though the deomposition of work is simple and proessing any output row i requires O( A rowi + B rowi ) work, it is hallenging to hoose the number of ooperative threads per row. Conversely, with respet to the global sorting sheme the total work to generate C is onsidered olletively as a single monolithi unit by ombining nonzero entries from both A and B into a intermediate matrix ˆT. By lexiographially sorting ˆT by (row, olumn) tuples all dupliate entries are adjaent and performing a simple tuple-wise redution generates C. Though not suseptible to penalties assoiated with row-wise irregularities of A or B the omplexity of globally sorting ˆT is O(k( A + B ) whih is k times more expensive than the aumulated ost of the row-wise approah, O( A + B ). Ideally a method to onstrut C should mitigate the impat of load imbalane while attaining near optimal work omplexity. To ahieve this requires a deomposition of work omposed of non-overlapping ranges from A and B suh that eah thread proesses a unique set of entries in C, thus orresponding to O( A + B ) work omplexity. We observe that when both input matries are sorted by row and row-wise internally sorted by olumn then sparse matrix addition may be formulated as a more general operation requiring the union of sets. Considering the natural omparison of tuples outlined in Algorithm 1 and the ordering of the input matries then parallel merging using merge path appears to be a natural fit for SpAdd. We forgo a detailed desription of the merge path algorithm and provide only a short introdution but refer to Green et al. for additional information []. Merge path deomposes the work of merging two sorted list by performing a binary searh along the diagonals formed by orienting the lists on the x and y axes as shown in Figure 1a. The ordering 49

4 a b e A a b d f B t t 1 t 2 t 3 d e (a) Merge path a b 1 2 e A a b t 4 d d e f f 6 B (b) Balaned path Figure 1: Comparison of deision paths for merge path and balaned path given two sorted sets A and B having six elements eah. The paths, and thus the outputs, are evenly partitioned among four threads. For balaned path, the partition boundary between thread t and t1 is starred so that zipped pairs are never split aross threads. t 1 t 2 of entries along diagonals presribe a partition of both lists into non-overlapping regions of uniform size that an be proessed in parallel by individual threads. Though SpAdd appears merge-like, parallel merge path partitioning is inadequate. Eah onseutive range of mathing (row, olumn) tuples must be available to the same thread and therefore always appear on the same side of any diagonal. We therefore introdue balaned path partitioning, illustrated in Figure 1b, to extend the merge path deomposition to partition the inputs aording to this additional onstraint. We desribe the balaned path methodology by onsidering two sorted lists, A and B, onsisting of dupliate keys in plae of sparse matrix tuples. The normal merge path 1 t f Inputs proessed/s (1 6 ) Number of inputs keys-32 keys-64 pairs-32 pairs-64 Figure 2: Performane of union operation on sorted sets. Entries per input array are divided evenly. implementation onsumes all dupliate keys in A before mathing entries in B therefore onstruting a merged result with dupliate entries. In ontrast balaned path partitioning assigns a rank to dupliate keys within eah dupliation range of the input sequenes. This enables the assignment of keys with mathing ranks to the same partition for serial merging within individual threads therefore ensuring the key-rank pairs emitted by eah thread during onstrution of the result is a unique non-dupliated member of the output list. In Figure 1b the stair-step pattern of balaned path snaking its way through keys for two sorted lists highlights the unique differenes between the deompositions where we see a starred diagonal indiating a translation of thread t1 s partition to inlude a mathing key from B. As in merge path the balaned path diagonals begin at fixed intervals. However, the intervals may shift to onsume one additional or fewer elements than the presribed number proessed per thread to ensure both elements of a mathed key-rank pair are found on the same side of a diagonal partition. Diagonals, mapping to entries (i, diag i) in A and B, that need to be extended to inlude an additional element from B, orresponding to a mathed pair, are starred. The starred balaned path diagonal now maps to mathed entries (i, diag i + diag ) in A and B therefore stealing one entry from the right partition and depositing it into the left partition. Note that performing this key-rank deomposition allows the implementation of not only unions but many other set operations, suh as intersetion, differene and symmetri differene [4]. For SpAdd we fous on the union of sets therefore the output list is omprised of only zeroth ranked entries from A and B. In Figure 2 we illustrate the performane of our set oriented union operation on various array sizes and data types. Using our balaned path approah we perfetly deompose the work required to identify and 41

5 redue dupliates. With respet to SpAdd we redue the global memory overhead by omputing the union of the matries in two phases. During the first phase the unique tuples in the union of A and B are ounted and spae for C is alloated. In the seond balaned path invoation the entries from eah matrix are loaded into shared memory and dupliate values are redued within the CTA. Note that although for wellformed sparse matries A and B there are at most two dupliates that must be redued to form eah unique entry of C our implementation is general enough to proess a muh wider range of input sets onsisting of an arbitrary number of dupliates. C. Multipliation SpGEMM is substantially more hallenging than SpMV and SpAdd beause of the added irregularity of the workload to dynamially expand, during formation of the produts, and ontrat, during redution of dupliates. If we onsider the row-wise produts expanded for our simple example matries we see the following intermediate representation: (,, 1) (,, 1) (1, 3, 6) (1,, 12) (1, 1, 4) (1, 1, 4) (,, 1) (1, 1, 1) (1, 1, 1) (1,, 12) (1,, 12) (1, 1, 24) (1, 1, 43) Ĉ = (1, 3, 28) = (1, 3, 6) = (1, 3, 34) (1, 1, 24) (1, 3, 28) (2, 1, 3), (2, 3, 3) (2, 3, 3) (2, 1, 3) (3, 3, 18) (3, 1, 12) (2, 1, 3) (2, 3, 3) (3, 1, 12) (3, 3, 18) (3, 1, 12) (3, 3, 18) whih redues to the sparse matrix 1 C = A B = This depition of SpGEMM deomposes the FLOPs into two phases, expansion and ontration. We onsider the total work as a funtion of the number of produts, N prods,inthe expansion phase sine it is a measure of the total number of global memory operations required to feth data from the input matries and it also desribes the number of insertions/redutions required to form C. By far the identifiation and redution of dupliates during the ontration phase is the most hallenging phase of the omputation and forms the primary performane limiting routine in many SpGEMM implementations. To ombat the irregularity of ontrating the intermediate matrix we propose a balaned approah based on merge path partitioning of the total number produts and split Clok yles (1 4 ) P-Pairs 1P-Pairs 1P-Keys 1P(28-bits) 1P(24-bits) Sorting method 1P(2-bits) 1P(16-bits) 1P(12-bits) Figure 4: Clok yles per CTA radix sorting operations for two-pass (2P) key-value pairs (red), one-pass (1P) keyvalue pairs, and one-pass keys-only routines as the number of sorting bits are dereased (blue). the proessing into separate but performane stable sorting phases, the first phase performed in shared memory and a seond phase performed in global memory. To partition the produts we first speify the number of produts that may be proessed per CTA, N CTA, then we ompute the segmented prefix-sum, S, of the size of eah row of B, B rowi, referened by the olumn indies of A. Following Nprods this preproessing phase N CTA CTAs are launhed to independently expand and proess N CTA produts. Eah CTA loates the range of produts it is assigned by performing a binary searh on S to loate the first orresponding entry nonzero entry from A greater than or equal to its CTA number times N CTA. By deomposing A at the granularity of produts our approah balanes the work perfetly aross CTAs irrespetive of the row-wise redutions ditated by the input matries. The first sorting phase redues dupliates within eah CTA using radix sort. Our key observation is that beause we expand entries in order aording to olumn entries of A then ontiguous entries within a CTA remain ordered by row and dupliates are in adjaent loations following a single radix sort on the olumn indies as illustrated in Figure 3, this is in ontrast to two phase sorting approahes [14]. Figure 4 illustrates the performane improvement ahieved by reduing the total number of radix-sort operations within a CTA. We benhmark the performane using the radixsort routine in the open-soure CUB programming framework [17] for 32-bit data types with 128 threads per CTA and 11 entries per thread. For key-value pairs we observe that performing a single radix-sort pass redues the number lok yles by approximately a fator of two yielding notable performane improvement. In Figure 4 we also ompare the performane of radix- 411

6 (,,χ) (,,χ) (,,χ) (1, 3,χ) (1, 3,χ) (1,,χ) (,,χ) (1,,χ) (,,χ) (,, 1) (1,,χ) (1,, 12) (,, 1) (1,,χ) (1,,χ) (1, 3,χ) (1, 3,χ) (1, 1, 19) (1,, 12) (1, 3,χ) (1, 1, 24) (1, 1, 43) (1, 3,χ) = (1, 3,χ) = (1, 3,χ) = = = (1, 3, 34) = (2, 1, 3) (2, 3,χ) (2, 1,χ) (2, 3,χ) (2, 1,χ) (2, 1,χ) (2, 1,χ) (2, 1, 3) (3, 1, 12) (3, 3,χ) (2, 1,χ) (3, 1,χ) (3, 1,χ) (3, 1,χ) (2, 3, 3) (1, 3, 34) (2, 3,χ) (3, 1, 12) (2, 3, 3) (3, 3,χ) (2, 3,χ) (2, 3,χ) (3, 3, 18) (3, 3,χ) (3, 3, 18) (3, 1,χ) (3, 3,χ) (3, 1,χ) (3, 3,χ) (a) (b) () (d) Figure 3: All of the intermediate indies are formed, (a), and partitioned into subsets of approximately equal size, (b). χ represents unformed produts orresponding to eah intermediate entry. Eah subset is sorted by olumn index independently, (), and dupliate entries are redued, (d). All the subsets are aggregated into a intermediate matrix that ontains dupliates but is not sorted by row or olumn, (e). The redued matrix is sorted, (f), the produts of the redued entries are omputed and the remaining dupliates are redued to form C, (g). (e) (f) (g) sort when the number of sorted bits are redued from 28 to 12 bits. Based on this observation we aggressively optimize our sorting implementation to minimize the proessing time by exploiting the limited range of the olumn indies and sorting the minimum number of bits required by omputing log 2 (n), the number of olumns in B. We further optimize our performane by embedding the permutation bits in the unused upper range of the olumn indies when possible. Using this approah we redue the number of shared memory operations and perform a keys-only CTA radix-sort. The permutation omputed by eah blok is then stored to global memory using 16-bit integers to minimize the memory traffi and storage overhead. The first phase onludes with eah CTA sanning the sorted entries to identify unique entries and asynhronously storing the redued set to global memory. Note that following the first phase no produts have been formed and the redued intermediate entries in global memory onsist of possible dupliates that are ompletely unordered. To ompute the unique dupliates in C we perform a two pass global radix-sort routine similar to the method outline by the ESC algorithm. The notable differene is that the global memory sorting routine omputes only the permutation that sorts the pairs and does not perform any reordering of the, urrently unformed, intermediate produts. In the third phase the redued intermediate produts are formed by performing the expansion phase a seond time. In this version eah CTA loads the permutation omputed during the first expansion and a segmented san is performed to redue permuted produts aording to the preomputed dupliate indiators. To redue the number of global memory load and store operations the redued values are not stored randomly bak to global memory. The output loation that orders the redued produts aording to global dupliation, omputed during the seond phase, is used to store entries in sorted order. The final step onsists of forming the values in C by performing a global memory redue-by-key operation on the ordered produts. A key feature of this two-level deomposition is that both the CTA-wide and global memory sorting operations are oblivious to the irregularity of the underlying input matries. By preproessing many of the dupliates in shared memory prior to the global memory pass the total work required to perform the expensive two-pass radix-sort operation is signifiantly redued in ases where eah blok yields only a small number of unique pairs, i.e., a large number of dupliate entries per CTA. IV. NUMERICAL RESULTS In this setion we evaluate the performane of our parallel proessing strategy and ompare our results with two pakages for proessing sparse matrix operations on GPUs, Cusp, open-soure, and Cusparse, losed soure. Our evaluation is performed on a set of SpMV matries taken from the University of Florida (UFL) sparse matrix olletion outlined in Table II and the onfiguration of our testing environment is desribed in Table I. All of the performane data was olleted using double preision arithmeti with the input matries resident in GPU memory. CPU i7-382 CPU 3.6GHz GPU GTX Titan.88GHz CUDA 6. GCC 4.8 -O3 Table I: System onfiguration settings. On the GPU error heking and orretion (ECC) is disabled. 412

7 Matrix rows olumns nonzeros avg/row std Dense Protein Spheres Cantilever Wind Tunnel Harbor QCD Ship Eonomis Epidemiology Aelerator Ciruit Webbase LP Table II: Unstrutured matries taken from the UFL sparse matrix olletion for performane testing. GFLOPs/s Time (ms) ρ Cusparse =.84 ρ Merge = Nonzeros (1 6 ) Merge Cusparse Figure 6: Comparison of SpMV performane to A. We use the orrelation oeffiient, ρ, as a measure of the performane preditability as a funtion of A. For our Merge sheme ρ is.97 indiating that prediting the SpMV performane of untested matries should be relatively aurate. 1 Dense Protein Spheres Cantilever Wind Harbor QCD Ship Eonomis Epidemiology Aelerator Ciruit Webbase LP Cusp Cusparse Merge Figure : Sparse matrix-vetor performane, measured in GFLOPs/s, omparison for CSR format in double preision for three implementation using Cusp, Cusparse, and Merge. A. SpMV In Figure we ompare the SpMV performane of three different implementations operating on an input matrix stored in CSR format. Although the performane for some matries is substantially higher using speialized storage formats our goal is to ompare the ahieved performane using a general and standard storage format. For eah matrix the performane reported is the average of 1 iterations. The Cusp SpMV implementation uses the vetorized CSR implementation disussed in Setion III-A. In the Merge implementation we detet the presene of empty rows in the input matries and adaptively swith between a faster method that assumes there are no empty rows and a slightly slower method that ompats the CSR row offsets prior to performing eah SpMV operation. From Figure we see that the relative performane between the three implementations varies with respet to eah matrix. We note that the performane of our Merge sheme remains ompetitive with the best ases in all tests exept the Dense matrix. For hallenging matries that exhibit a large degree of irregularity in the number of entries per row, Webbase and LP, we see that our flat deomposition ahieves markedly better performane ompared to the other implementations. To illustrate the onnetion between the required work, proportional to A for SpMV, and the performane we plot the proessing time versus A for eah matrix in Figure 6. From Figure 6 we see that the performane of the Merge approah is highly orrelated, with a orrelation oeffiient of.97, with number the number of nonzeros in eah matrix and exhibiting only small perturbations from the least-squares fit of the data points. B. SpAdd In Figure 7 we ompare SpAdd, A + A, performane for the previously mentioned pakages. The performane illustrated is the speedup with respet to the sequential implementation using CSR format on the CPU. We note that the storage format for these tests are different between the pakages. The Cusp and Merge approahes operate on the input matries stored in expanded COO format to failitate proessing at the granularity of individual nonzero entries while Cusparse operates on the matries in CSR format. The Cusp implementation uses the two-pass global sorting approah desribed in Setion III-B. In ontrast with SpMV we see that both Cusparse and Merge perform signifiantly better than Cusp on all matries in the test suite. Another notable differene is the signifiant performane improvement of Cusparse over the Merge implementation for the Dense, Protein, and Wind matries. However, for the remaining matries in the test suite the performane of the two methods are omparable with the exeption of the Webbase and LP matries. 413

8 Speedup Speedup Dense Protein Spheres Cantilever Wind Harbor QCD Ship Eonomis Epidemiology Aelerator Ciruit Webbase LP Cusp Cusparse Merge Figure 7: SpAdd performane, measured in speedup versus sequential CPU implementation, for Cusp(COO), Cusparse(CSR), and Merge(COO). 2 1 Dense Protein Spheres Cantilever Wind Harbor QCD Ship Eonomis Epidemiology Aelerator Ciruit Cusp Cusparse Merge Webbase LP Figure 9: SpGEMM performane, measured in speedup versus sequential CPU implementation, for double preision for Cusp(COO), Cusparse(CSR), and Merge(CSR). Time (ms) ρ Cusparse =.68 ρ Merge = Nonzeros (1 6 ) Merge Cusparse Figure 8: Comparison of SpAdd performane to the number of nonzeros. For the merge implementation our orrelation oeffiient of 1 implies a nearly perfet math between the work and proessing time irrespetive of the underlying struture of the input matries. In Figure 8 we again plot the performane versus the total work to understand the orrelation between the required work and the total proessing time. Not surprisingly the Merge approah is perfetly orrelated with the number of nonzeros in the input matries. This is expeted sine the approah is based on parallel deompositions that yield perfet balane irrespetive of the segmentation of the underlying data. Note that for two of the largest SpAdd instanes Cusparse ahieves speedup for one and dramati slowdown for the other ompared to the Merge implementation. C. SpGEMM In Figure 9 we ompare SpGEMM, A A, performane for all three implementations. In the speial ase of the nonsquare LP matrix we transpose the matrix and perform A A T. The performane is tabulated in terms of the average speedup for 1 iterations versus the sequential CPU implementation in CSR format. The Cusp implementation uses a global memory ESC algorithm and although the performane is inherently independent of the input matries the absolute time is high beause of global operations involving the total number of produts, speifially radix sort. For many of the matries the Cusparse and Merge approahes outperform Cusp by a signifiant margin. However, the performane of Cusparse degrades for the Eonomis, Ciruit, Webbase, and LP matries. In ontrast the Merge approah sustains performane improvement ompared to Cusp in all instanes. The weakness of sort based SpGEMM methods operating on intermediate produts is also illustrated in Figure 9 where both the Cusp and Merge approahes required more physial memory than the resoure onstrained GPU ould support. For the Dense test ase the dupliates per CTA is lose to zero therefore the global memory pass requires a signifiant amount of temporary storage to order the expanded entries. For matries with a relatively dense number of entries per row segmented proessing presents a notable advantage. Although both the Webbase and LP matries also have rows ontaining a relatively large number of entries the powerlaw distribution of the row lengths in the ase of Webbase and the small number of rows in LP, sine it is omputing A A T, ensures that on average many of the CTAs ontrat a substantially large number of intermediate entries during the first sorting pass. In Figure 1 we plot the performane of the Merge and Cusparse implementations versus the total number of produts required to form C. Note that this analysis ignores the FLOPs performed to redue dupliate entries to form C. From Figure 1 we see that the performane of the two- 414

9 Matrix Label LP Webbase Ciruit Aelerator Epidemiology Eonomis Ship QCD Harbor Wind Cantilever Spheres Protein Setup Perent Time Produt Compute Produt Redue Blok Sort Global Sort Other Figure 11: SpGEMM performane breakdown Total Time (ms) level sorting approah based on merge path deompositions is highly orrelated with the total number of produts. This regularity implies that the performane of the Merge approah is easily preditable based on simple analysis of the input matries prior to exeution of the operation. The observed regularity in the Merge sheme performane is in ontrast with Cusparse whih diverges onsiderably from the antiipated performane in two ases. We do not laim the Cusparse implementation is inherently unpreditable but that the performane appears to be unrelated to our interpretation of the required work, the number of produts. In Figure 11 we deompose the performane of the merge SpGEMM performane into several phases. During the Setup phase the number of entries on eah row of B referened by eah olumn of A is sanned to ompute the number of produts required for eah nonzero entry in A. During Blok Sort the produts are deomposed into a fixed number of entries and proessed using a single radixsort pass within eah CTA and the resulting permutation is stored to global memory. The Global Sort phase organizes the redued number of entries identified by eah CTA using two-radix passes and the Produt Compute phase ombines the loal CTA permutation with the global dupliate permutation to form the ordered redued produts in global memory. In the last phase the final produts are formed by performing a redue-by-key operation on dupliates in global memory. Additional overheads in terms of alloation and misellaneous memory operations are aggregated into Other. From Figure 11 it is lear that the two sorting passes and the omputation of the output produts onstitute the bulk of the proessing time for eah matrix. Although there are signifiant variations in the total relative time of eah operation based on the matrix it is important to onsider these variations with respet to the total proessing time, shown on the right axis. V. CONCLUSION Although the speifi tehniques employed to perform eah sparse matrix operation varied substantially the fous of our study was the performane impliations of using a flat deomposition of the work irrespetive of the underlying segmentation ditated by the input operands. Note that our method does not ahieve the best performane in all ases for any of our tests. Indeed, we expet when omparing any two algorithms proessing suh irregular omputations that subtle data-dependent performane harateristis may have a large impat on the overall proessing time. Our primary fous in this work is ahieving preditable performane aross a vast landsape of input instanes. In future work we plan to address the defiienies of sort based SpGEMM methods by adaptively introduing segmented approahes when neessary. Deteting speifi ases like the Dense matrix is relatively simple but would also require a more detailed model to aurately predit the trade-off between the number of entries proessed in the segmented and unsegmented regions of the algorithm. REFERENCES [1] M. Garland and D. B. Kirk, Understanding throughputoriented arhitetures, Commun. ACM, vol. 3, pp. 8 66, November 21. [Online]. Available: 114/ [2] A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, Effiient parallel merge sort for fixed and variable length keys, in Innovative Parallel Computing, May 212, p

10 Time (ms) Time (ms) ρ Merge = Number of produts (1 6 ) (a) Merge ρ Cusparse = Number of produts (1 6 ) (b) Cusparse Figure 1: Comparison of SpGEMM performane to the number of produts. Flat two-level sorting results in performane that is highly orrelated, ρ =.98, with the work expressed as a funtion of number of produts in the intermediate matrix. [3] D. Merrill and A. Grimshaw, High performane and salable radix sorting: A ase study of implementing dynami parallelism for GPU omputing, Parallel Proessing Letters, vol. 21, no. 2, pp , 211. [Online]. Available: 21/212/S html [4] Mgpu : Design patterns for gpu omputing, github.io/moderngpu/, version 1.1. [] O. Green, R. MColl, and D. A. Bader, Gpu merge path: A gpu merging algorithm, in Proeedings of the 26th ACM International Conferene on Superomputing, ser. ICS 12. New York, NY, USA: ACM, 212, pp [Online]. Available: [6] R. W. Vudu, Automati performane tuning of sparse matrix kernels, Ph.D. dissertation, 23, aai [7] S. Williams, L. Oliker, R. Vudu, J. Shalf, K. Yelik, and J. Demmel, Optimization of sparse matrix-vetor multipliation on emerging multiore platforms, Parallel Computing, vol. 3, no. 3, pp , 29, revolutionary Tehnologies for Aeleration of Emerging Petasale Appliations. [Online]. Available: sienediret.om/siene/artile/pii/s [8] N. Bell and M. Garland, Implementing sparse matrix-vetor multipliation on throughput-oriented proessors, in SC 9: Proeedings of the Conferene on High Performane Computing Networking, Storage and Analysis. New York, NY, USA: ACM, 29, pp [9] J. W. Choi, A. Singh, and R. W. Vudu, Modeldriven autotuning of sparse matrix-vetor multiply on gpus, SIGPLAN Not., vol. 4, no., pp , Jan. 21. [Online]. Available: [1] J. L. Greathouse and M. Daga, Effiient sparse matrixvetor multipliation on gpus using the sr storage format, in Proeedings of the International Conferene for High Performane Computing, Networking, Storage and Analysis, ser. SC 14. Pisataway, NJ, USA: IEEE Press, 214, pp [Online]. Available: [11] A. Monakov, A. Lokhmotov, and A. Avetisyan, Automatially tuning sparse matrix-vetor multipliation for gpu arhitetures, in High Performane Embedded Arhitetures and Compilers, ser. Leture Notes in Computer Siene, Y. Patt, P. Foglia, E. Duesterwald, P. Faraboshi, and X. Martorell, Eds. Springer Berlin Heidelberg, 21, vol. 92, pp [Online]. Available: 1 [12] F. G. Gustavson, Two fast algorithms for sparse matries: Multipliation and permuted transposition, ACM Trans. Math. Softw., vol. 4, pp , September [Online]. Available: [13] A. Buluç and J. R. Gilbert, Parallel sparse matrixmatrix multipliation and indexing: Implementation and experiments, SIAM Journal of Sientifi Computing (SISC), vol. 34, no. 4, pp , 212. [Online]. Available: aydin/spgemm sis12.pdf [14] N. Bell, S. Dalton, and L. Olson, Exposing fine-grained parallelism in algebrai multigrid methods, SIAM Journal on Sientifi Computing, vol. 34, no. 4, pp. C123 C12, 212. [Online]. Available: / [1] S. Dalton, N. Bell, and L. Olson, Optimizing sparse matrixmatrix multipliation for the gpu, Department of Computer Siene, University of Illinois at Urbana-Champaign, Champaign, Illinois, Teh. Rep., February 213. [16] W. Liu and B. Vinter, An effiient gpu general sparse matrix-matrix multipliation for irregular data, in Parallel and Distributed Proessing Symposium, 214 IEEE 28th International, May 214, pp [17] Cub : Reusable software for uda programming, nvlabs.github.io/ub/, version

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,