FPGA Multi-Processor for Sparse Matrix Applications

Size: px

Start display at page:

Download "FPGA Multi-Processor for Sparse Matrix Applications"

Blake Williamson
5 years ago
Views:

1 1 FPGA Multi-Processor for Sparse Matrix Applications João Pinhão, IST Abstract In this thesis it is proposed a novel computational efficient multi-core architecture for solving sparse matrix-vector multiply (SpMV) in FPGA. The efficient implementation of SpMV is challenging, as simple implementations of the kernels typically give a performance that is only a fraction of the peak. At the center of this problem is that sparse operations are more bandwidth-bound than dense ones. Previous works on the subject suggest that the use of FPGA for solving SpMV can improve performance levels when compared to the use of General Purpose Processors (GPPs), thereby improving total execution times when solving problems largely dependent of SpMV. As such, in this work the existing sparse matrix compression/storing formats are analyzed, their feasibility for an efficient implementation is verified, and lastly a multi-processor architecture is proposed, capable of better using the available bandwidth in order to achieve higher levels of performance. Through extensive experimentation on a wide dataset, the architecture exhibits the ability to outperform GPPs both in terms of peak and average performance. Given the target platform (Zynq), the performance of the developed architecture was compared to the ARM Cortex- A9 GPP present the platform. The architecture has shown to achieve an average of 78.5% ( MFLOPS) performance with a peak of 98.2% ( MFLOPS). On the other hand, the ARM processor attained an average of MFLOPS, equivalent to 21.8% of its peak performance, while achieving a peak of MFLOPS (31.4%) of performance in one test. This translates into a performance improvement of , averaging at 4.48 the performance of the ARM processor tested. Index Terms Sparse Matrices, FPGA, Multi-Core Architecture I. INTRODUCTION Sparse Linear algebra computations such as the matrixvector product or the solution of sparse linear systems are often the bottleneck of many scientific fields from computing fluid dynamics to structural engineering, electromagnetic analysis or even the study of economy models, to name a few. The task of computation regularly falls to CPUs and, due to the evolution in this field, performance is improving. However, the task of obtaining peak performance from modern cache-based computational systems has proven to be extremely difficult. Several factors contribute to this low performance such as the underlying machine architecture, memory access behavior, compiler technology and the nature of the input matrix which may only be known at runtime. In sparse matrices, the fraction of nonzero elements is small when compared to the total number of elements. While it is possible to use generic data structures and routines to perform computations with such matrices, this is inefficient as most calculations on zero elements are redundant and sometimes even impractical due to the large dimensions of these matrices. In practice, sparse matrices are stored using specialized data structures that only store the nonzero values, along with additional information regarding the position of each in the matrix, having its total size proportional to the number of nonzero elements. As an example of applicability in most of the fields named above, solving a partial differential equation using finite elements method boils down to solving a system of linear equations of the form y = Ax, where y and x are vectors and A is a large matrix that is mostly composed of zero entries. Nonzero elements of A would be arranged in a regular or irregular pattern depending on the selection of a structured or unstructured mesh for discretization of the original problem [1, 2]. The efficient implementation of these operations is extremely important; however, it is challenging as well, as simple implementations of the kernels typically give a performance that is only a fraction of peak [3, 4]. The center of its performance problem is that sparse operations are more bandwidth-bound than dense ones [5, 6]. Consequently, optimizations are needed, but these are much dependent on architectural variations even between closely related versions of the same processor. To this end, a number of optimization techniques have been proposed, such as register and cache blocking [7, 8], and column or row reordering [9]. The main contribution of this thesis is a novel approach to the efficient computation of the Sparse Matrix-Vector Multiply (SpMV) problem. To perform this, a reconfigurable multiprocessor architecture has been researched and developed, capable of sustaining high peak performance by avoiding the common pitfalls that affect SpMV computation and therefore achieving better computational efficiency. This was only made possible after a thorough analysis of the storing formats and by efficiently exploiting the advantages provided by programmable logic, such as reduced costs for reading and writing to local memory, pipelining the datapaths, etc. To complement the developed architecture and in order to further reduce the bandwidth required to perform SpMV, a novel storing format named RCSC is also proposed. Storing the sparse matrix according to this format further reduces the required bandwidth to perform SpMV without increasing the storage space needed, keeping in line with formats known to reduce size to a minimum. The use of this format removes the irregular memory accesses that limit the SpMV performance allowing for a streamline approach to data transfer. The removal of irregular accesses is the most important performance improvement factor for the SpMV problem, where the possibility of data reuse is very limited. The architecture is capable of receiving data via DMA [10] streaming and equally partition the computation amongst the implemented PE. These are capable of producing one result per clock cycle due to their pipelined FMA [11]. Although data retrieval from external memory and computation in the

2 2 PEs is fully overlapped, due to the unknown sparsity pattern of the matrix the process of writing the output vector to external memory cannot be overlapped. The architecture was implemented on an evaluation and development board based on the Xilinx Zynq-7020 All Programmable (AP) System-on-Chip (SoC). The Processing System (PS) of the Zynq-7020 device can provide a bandwidth of 1600MB/s which allows for an architecture composed of 4 PE to be implemented. Results wise, an average of 624 MFLOPS and 78.5% efficiency is attained for a dataset of 14 sparse matrices and input vectors. When compared to the performance of the ARM Cortex-A9, present in the PS side of the Zynq, execution times are improved by , while efficiency of the ARM GPP is on average 21.8%, equivalent to 142 MFLOPS. A. Storing Formats II. SPARSE MATRIX ALGORITHMS Enormous effort has been devoted to devise data storing formats with the aim of maximizing performance. This means that to fully optimize SpMV computation, we need to choose a compression algorithm that takes the sparse matrices structures into account. The focus of this section is to briefly describe the most common compression/storage formats available. The purpose of all different format variations is either to improve the architectures ability to access data and perform computations or to reduce the total space required to store the matrix. The most common format regardless of the used processor is CSR, which stores the nonzeros row-wise. Another favorite for implementations in programmable logic fabric is CSC, which stores the nonzeros column-wise. The use of other storing formats depends on the processor used, as ELLPACK and S- ELLPACK are most useful in GPU implementations due to their SIMD capabilities. 1) CSR: The Compressed Sparse Row (CSR) format stores an initial M N sparse matrix A in row form using three one-dimensional arrays. Let n z denote the number of nonzero elements of A. The first array is called V AL and is of length n z. This array holds the values of all the nonzero entries of A in left-to-right then top-to-bottom (row-major) order. The second array, COL IND, contains the column index (zero-based) of each element of V AL. The third and final array, ROW P T R, is of length M + 1 (i.e. one entry per row, plus one). ROW P T R(i) contains the index in A of the first nonzero element of row i. Row i of the original matrix A extends from V AL(ROW P T R(i)) to V AL(ROW P T R(i + 1) 1), i.e. from the start of one row to the last index before the start of the next. The last entry in ROW P T R (at zero-based index m) must be the number of elements in V AL (i.e. the number of nonzero elements of A). The name of this format is based on the fact that row index information is compressed, which delivers increased efficiency in arithmetic operations and row slicing partitioning algorithms. A depiction of sparse matrix A of figure 1 compressed in CSR format is shown in figure 2. 2) CSC: Compressed Sparse Column (CSC) format is analogous to CSR except that it follow a column-major order for ordering the nonzero elements in arrays. Instead of a column index, a row index is stored for each value, as are column pointers. This means the CSC format is composed of vectors V AL, ROW IND and COL P T R, where V AL is an array of the (top to bottom then left to right) nonzero values of the sparse matrix; ROW IN D is the row indexes corresponding to the values in V AL; and COL P T R contains the index in A of the first nonzero element of column j. Due to its efficiency in arithmetic operations, column slicing, and matrixvector products, this is the traditional format for specifying a sparse matrix in MATrix LABoratory (MATLAB). A depiction of sparse matrix A of figure 1 compressed in CSR format is shown in figure 3. Fig. 1. Sparse matrix A A = V AL = [ ] COL IN D = [ ] ROW P T R = [ ] Fig. 2. Sparse matrix A in CSR format V AL = [ ] ROW IN D = [ ] COL P T R = [ ] Fig. 3. Sparse matrix A in CSC format When comparing both of the previous formats, and assuming all vectors related to the compression format and vector x are stored in memory, CSC provides less accesses to memory when compared to CSR since the values of vector x are reused in the computation of each nonzero element belonging to the same column. For format CSR more values have to be transfered from memory for each computation. First, two values of vector ROW P T R need to be transfered. By computing ROW P T R[i + 1] ROW P T R[i], the number of nonzero elements in the row is obtained. Then, for each (value, index)- pair of nonzero elements transfered, the respective x[index] needs to be retrieved. This not only translates into a different value of vector x for each nonzero in every row, but also a delay in memory transfers and isolated accesses. This delay is caused by the necessary sequence in transfers of nonzero indexes and values while the isolated transfers are caused by the values of x not being necessarily adjacent and, as such, burst reading or caching of values not being very useful. These problems do not occur when dealing with a sparse matrix stored in CSC format. For every nonzero value and index transfered, values of x needed are reused for each column analyzed. This leads to improvements when using cache and burst readings from memory, an important factor in SpMV.

3 3 B. SpMV Algorithms Most of the previous relevant works on this area have been done on General Purpose Processors (GPPs) and GPUs, where improvements can mostly be done regarding the data stored in cache or in local GPU memory. An analysis of five FPGAbased architectures indicated that performance improvements to the SpMV problem depend on efficiently exploring the storing formats while minimizing the communication overheads. By approaching the problem in a row-wise fashion, two of the five analyzed works [12, 13] achieved low computational efficiencies due to limitations in scaling the implemented design. Another work [14] based on format CSR achieved a higher efficiency at the cost of locally storing a copy of vector x in each PE and therefore eliminating indirect references to this vector. The two remaining works [15, 16] approached the problem column-wise by choosing CSC and SPAR formats, respectively. By reducing the indirect references to vector x, the use of a column-major storing format provided a higher computational efficiency and improved the scalability of the architecture without increasing local memory requirements. In order to develop an efficient SpMV architecture, a suitable storing format must be chosen. The choice lies between row-major, column-major and other formats. This last option includes formats that imply variable arguments, i.e. a variable block size, which is only known at runtime. Therefore a hardware implementation of an SpMV accelerator based on these formats is likely to perform poorly. As such, by discarding these variable formats, the choice is now limited to row-major formats or column-major formats. The deciding factor is given by the ability of each format in minimizing one of the major problems in efficient SpMV computation: indirect memory references. If a sparse matrix is stored in a rowmajor format, e.g. CSR, the nonzero column indexes are also stored. These are the cause of the indirect memory references, as x[index] = x BASE ADDRESS + index is performed for each nonzero element in the matrix, and the index value is not likely to follow any pattern. Another factor to consider is that 3 load operations precede each nonzero computation. One to retrieve the nonzero element, another for the respective index, and one for the value of x given by the nonzero index. In turn, by assuming a sparse matrix stored in a columnmajor format, e.g CSC, the indexes represent the row of each nonzero element. In an FPGA-based implementation, by storing the output vector y in local memory consisting of BRAMs, these irregular accesses cost one clock cycle, as opposed to several in external memory. As such, it is reasonable to say that a column-major format is the most suitable for our implementation. The proposal herein is to modify the CSC format in which the sparse matrix is stored, such that only 2 vectors are used to represent the matrix-vector data, instead of 4. Vector x and vector V AL are merged in one vector still named V AL X, and vectors ROW IND and COL P T R are merged in one vector named IND P T R. Given the reasons presented above, an important number of improvements, such as the use of FMA units in order to pipeline computations, RAW hazard detection and methods of workload division across a configurable number of PEs, are proposed which aim to improve the computation of SpMV. Based on these improvements and the RCSC storing format, the development of a reconfigurable architecture capable of achieving high computation efficiency and reduced execution time is detailed. III. ARCHITECTURE OVERVIEW A. Hardware/Software Interface Evaluation In order to achieve the best possible utilization of the available bandwidth to ensure an optimized SpMV accelerator, the capabilities if the target device, the Zynq-7000, had to be evaluated. The available interfaces for integrating PLbased accelerators and the PS are: the AXI ACP, AXI HP and AXI GP. The AXI High Performance (HP) interface is composed of four high performance/high bandwidth master ports capable of 32-bit or 64-bit data transfers with separate read/write command issuing for concurrent operations, mediated by an asynchronous 1KB FIFO. Each of these interfaces can provide at most 1200MB/s when 64-bit wide, just as the AXI ACP, or 600MB/s for 32-bit wide. When using all four high performance ports, the total available throughput is 4800MB/s when using 64-bit mode or 2400MB/s when in 32- bit mode. All these throughputs assume a clock frequency of 150MHz. Hence, using the four available High Performance Ports to transfer data to/from an architecture implemented in the PL and external memory (i.e. DDR) provides four 32- bit words per clock cycle and an average throughput of 2538 MB/s. B. Workload Balancing No sparsity pattern is assumed for the input matrices. As such, data transfers and computational parallelism needs to be chosen and scheduled accordingly. The most common is the diagonal pattern 4(a) in which all nonzero values are located in the main matrix diagonal and/or scattered alongside it. It is included in the denomination of diagonal pattern matrices consisting of patterns with 3 diagonals 4(b), in which the nonzero values are located across 3 diagonals, one primary and two secondary. As such, assigning row ranges per PE results in an idle system the majority of the time. To solve this, single rows were alternately assigned to each processor in a round-robin scheme. This guarantees a good workload balance amongst the existing processors, as evidenced in figure 5, and thus, given a good hardware implementation of each PE, it is expected that the least number of PEs are idle for the least amount of time. (a) Sparse cdde1 Matrix (b) Sparse ck656 Fig. 4. Diagonal Sparse Matrices cdde1 and ck656 from [17] Matrix

4 Fig. 5. Nonzero assignment of a sample of 7 sparse matrices to a 4 PE system (percentage of total nonzero elements of the respective matrix) C. Architecture Fig. 6.

4 4 Fig. 5. Nonzero assignment of a sample of 7 sparse matrices to a 4 PE system (percentage of total nonzero elements of the respective matrix) C. Architecture Fig. 6. RCSC Architecture Block Diagram Fig. 7. Processing Element Addressing the problems inherent to SpMV and, in conjunction with an exploration of the target device capabilities, an architecture that seeks to solve these problems was developed. This involved scheduling data transfers and computational parallelism accordingly. Through the methodical process of testing the architecture, an efficient and scalable architecture was developed, capable of processing any sparse matrix and input vector. The SpMV Accelerator (figure 6) consists of a Packager Unit, that processes and distributes the matrix/vector data to a Linear Processing Array, composed by p PEs connected linearly. Data is received at the Packager through two DMA units connected to two HP ports. Each Processing Element is composed of an input FIFO meant to store the instructions sent by the Packager to be processed, a delay block, a Fused Multiply-Add (FMA) [11] operator and local memory. Figure 7 shows a block diagram of each PE. The FMA performs single precision floating-point operations, only rounding and normalizing the result at the final stage, in order to maximize the precision of the multiply-add sequence. The FMA is implemented as an 8-stage pipeline, and can produce one result per clock cycle. When processing a row of the sparse matrix in a PE, a Read After Write (RAW) hazard may occur when the value entering is read before it is updated in local memory. To solve these, the delay block, whose purpose is to delay the address of local memory to be updated, is altered to indicate if a computation entering the FMA creates a RAW Hazard. When this occurs, stalls are implemented until the address is updated with the correct value. Testing of the architecture has shown that a limitation exists in the number of PEs that can be fed with data by the Packager. Given the need to verify a RAW hazard before each computation, the PEs requires two clock cycles per computation entering the FMA. Given that the Packager instructs one computation per clock cycle to the Linear Processor Array, more than two PEs resulted in idle processors. The novel format (RCSC) proposed in this work requires a lower bandwidth per accelerator and therefore allows to increase the number of parallel accelerators for the same fixed bandwidth. Given the HP interface capability of providing 128-bits per clock cycle and the developed SpMV accelerator requiring 64-bits, two accelerators can be implemented in the target device. Therefore, the best partitioning scheme between accelerators was also studied. We concluded that partitioning the sparse matrix in row blocks and assigning each to an accelerator lead to an increase in the number of RAW hazards in the blocks. Another limitation was the need to pre-process the matrix to construct these blocks, resulting in both more execution time and more external memory usage. On the other hand, partitioning the matrix in column blocks resulted in almost no pre-processing, as RCSC is already a columnmajor format. The processing of each block results in a partial vector per accelerator, with the final output vector y being the accumulation of all partial vectors. The best process of accumulating these vectors was also studied. Both options for reduction circuits, an Adder Tree and a Queues Cascade as pictured in figure 8, were studied. Given the execution times of both options, the Adder Tree was chosen as the most suitable to sum all partial vectors. IV. PERFORMANCE EVALUATION To verify the functionality, the architecture was implemented in the target device, the Zynq Figure 9 also includes the NEON MPE and VFPv3 cores of the Application Processing Unit (APU), as well as the timer. The NEON and the Vector Floating-Point s(vfp) cores extend the functionality of the ARM processor by providing support for advanced SIMD and vector floating-point v3 instruction sets. Some of the features of the NEON MPE core are SIMD vector and scalar single-precision floating-point computations. The vectors supported are 128-bit wide due to the registers used but will execute with 64-bits at a time. Since both Advanced SIMD and VFPv3 are implemented, they share a bit register bank. The inclusion of these cores on the schematic seen in figure 9 is tied to the optimizations done to the code run by the ARM processor. To compare computational times required to

5 5 (a) Adder Tree Fig. 9. Implementation of two SpMV Accelerators interfaces, which are connected to the AXI HP and AXI GP ports respectively. For each resource is also presented a percentage of the total available in the device. (b) Queues Cascade Fig. 8. Methods of reducing vector y in Column Blocks for a system composed of k SpMV Accelerators obtain both hardware and software results, the ARM processor was made sure to fully utilize the available resources. This implies using compiler high optimization levels (flags -O3 or -Otime), using NEON and VFP SIMD instructions (flag mfpu=neon-vfpv3), taking advantage of L1 and L2 caches, loop unrolling, function inlinning, and overall improvements to the algorithm that result in a reduced execution time. When comparing both hardware and software results, due to the use of a Fused Multiply-Adder in the PEs, several results are shown to be slightly different. This is to be expected, as the FMA unit performs a single rounding at the end of the fused multiplication-addition and the NEON unit truncates and then rounds the result. This yields a more accurate result from the accelerator when compared to the ARM processor, as is specified in the IEEE standard for Floating-Point Arithmetic 754. A. Experimental Results In terms of resource utilization and working frequency, two cases are presented in table I: resource utilization considering only the SpMV Accelerators (composed of two Packagers and four PEs), and total resource utilization considering the whole system, i.e. the SpMV Accelerators, four DMA cores and AXI Interconnects for the AXI4-FULL and AXI4-LITE TABLE I RESOURCE UTILIZATION OF TWO SPMV ACCELERATORS PLUS REDUCTION CIRCUIT AND SYSTEM OVERALL Resource Total Accelerators + Reduction System Registers (3,64%) 9473 (8,90%) LUTs (10,30%) (20,12%) BRAMs (5,71%) 36 (25,71%) DSP (3,64%) 8 (3,64%) The ability to scale the architecture is limited by the available memory bandwidth. For each additional accelerator, an additional 64-bits per clock cycle are required. The number of clock cycles required to perform sparse matrix-vector multiply on both the ARM processor (with all possible improvements) and the architecture implemented in programmable logic was measured. This measurement was performed by setting the timer to zero right before the start of computation with a posterior value retrieval. As such, the time required to compute a matrix was not directly measured, but rather the number of cycles passed in the timer during said computation. With the division of this value by the frequency of the timer (355MHz) the execution time can be obtained. The measurement of the time required to compute several sparse matrices from the University of Florida Sparse Matrix Collection [17] was measured using the above described process, resulting in the values presented in table II. Several sparse matrices were tested. The purpose of each test was to verify if the computation performed using the developed accelerator resulted in a lower execution time when compared to the same computation

6 6 performed in the ARM processor. To verify the reliability of the accelerator to process any sparse matrix, the chosen dataset II included matrices whose parameters vary greatly. From square (N N, e.g. bcsstm27) to rectangular matrices (M N, e.g. Flower 5 4) and diverse sparsity values ranging from as low as 0.02% (e.g. OPF 6000) to 23.08% (e.g. bibd 14 7). Within all matrices the number of hazards varies greatly, from zero (e.g. bibd 14 7) to 47.46% (e.g. N pid). Each of these parameters influences computational time, although only the accelerators are affected by RAW hazards, as the ARM processor follows a distinct execution path. Attention was also given to the existing patterns in the matrices and to the field of science each matrix belongs as to include the most possible types in the dataset. The accelerator achieves an average performance of 624 MFLOPS for a computational efficiency of 78.03%. This can be seen in figure 10 where the performance of both the ARM processor and the architecture are represented in terms of achieved MFLOPS per matrix in the dataset. The respective peak performances are represented as well, clearly showing that the ARM processor achieves a performance level well below peak. B. Analytical Model An analytical model was developed in order to predict the performance of the architecture with larger sparse matrices, and when using more SpMV Accelerators. Given that the transfer of data from the external memory and computation in the SpMV Accelerator are performed concurrently, execution time depends mostly on the communication performance. To develop an analytical model each component that constitutes the implemented accelerators was modeled by determining the number of clock cycles required to process a nonzero element. For the Packager unit, as one clock cycle is required per nonzero and structural element of the matrix, the total number of cycles required to process the entire sparse matrix is given by n z + n. As one instruction is sent to the PEs per nonzero element, the number of clock cycles required per nonzero element is given by equation 1. P ackager [cycles/nz] = n z + n n z (1) The number of cycles required by the PEs to perform computation depends on the number and distribution of hazards within the matrix. As no pattern can be assumed for the nonzero elements within a sparse matrix, a simple Bernoulli distribution is assumed for the occurrence of hazards. As such, the number of cycles required by each PE to process the nonzero elements arriving via the Packager is given by equation 2: P E [cycles/nz] =W orst stall p hazard + No stall (1 p hazard ) where p hazard is the probability of a hazard in the sparse matrix (equal to zero when no hazards occur and equal to one when a hazard exists for every nonzero element), W orst stall represents the highest number of clock cycles the PE needs to be stalled in order for the hazard to disperse and No stall (2) gives the minimum number of clock cycles required for the nonzero element to enter the FMA unit when no hazard is detected. The value of variable W orst stall is given by the number of pipeline stages of the FMA. In this architecture this value is always equal to 8. No stall is 2 as one clock cycle is required to read an address from the input FIFO and another for the corresponding y element to be at the output of local memory. As one SpMV Accelerator is composed of one Packager and two PEs, and given that operations in the Packager and the PEs are overlapped, the total number of clock cycles required to process a nonzero element is given by the component with the lowest throughput, the Packager or the two PEs, as shown by equation 3. ( Accelerator [cycles/nz] =max P ackager [cycles/nz] ; P E [cycles/nz] 2 The estimate for the total execution time must also consider the transfer of all elements of vector y to external memory. For this, the internal DMA FIFOs with a fixed length of 512 elements need to be filled with the maximum burst length (256 elements) before the transfer is started. After the first burst of 256 elements, the process of filling the FIFO and transmitting the elements of vector y is performed simultaneously. With this latency accounted for, the execution time for an implementation of k accelerators can be estimated by equation 4, where m is the number of elements of vector y, f represents the frequency at which the SpMV Accelerator works in MHz, and the reduction circuit now contributes to a larger portion of execution time as it adds a term dependent on the number of pipeline stages of the adder used (Adder stages) times the number of reduction levels ( log 2 (k) ), as previously depicted in figure 8: T heo exec time [us] = ( ( ) n z k max P E[cycles/nz] P ackager [cycles/nz] ; + 2 ) m + Adder stages log 2 (k) f To validate the developed model, estimated execution times were calculated for several sparse matrices using equation 4 and compared to the values measured from execution in the target device. Table III shows both experimental and estimated execution times (in microseconds), as well as the error between measurement and estimation. C. Estimation for k accelerators Assuming a target device with enough available bandwidth to feed a number k > 2 of accelerators, the analytical model can estimate the execution times for any sparse matrix. Execution times were estimated for all matrices that constitute the dataset, although, in order to avoid redundancy, matrices that ) (3) (4)

7 7 TABLE II EXECUTION TIME REQUIRED TO COMPUTE SPARSE MATRICES IN BOTH THE ARM AND ONE SPMV ACCELERATOR AND RESPECTIVE OBTAINED SPEEDUP AND ARCHITECTURAL EFFICIENCY Matrix Rows Columns Sparsity (%) Hazards (%) ARM (us) Accelerator (us) Speedup Efficiency (%) MK9-B MK10-B Maragal Flower bibd LP Pilot bcsstm qc big adder dcop p2p gnutella N pid OPF SiNa Fig. 10. ARM and Hardware performance measurement in MFLOPS for each matrix in the dataset TABLE III EXPERIMENTAL RESULTS VS. MODEL PREDICTIONS Matrix HW exec (us) Theo exec (us) Error (%) MK9-B MK10-B Maragal Flower bibd LP Pilot bcsstm qc big adder dcop p2p gnutella N pid OPF SiNa produced similar results are not shown in table III. The number of Accelerators was varied between three and seven and for comparison sake, results from implementations with one and two accelerators are also included. Despite the length of the reduction circuit increasing with the number of implemented accelerators (as given by the reduction time increasing with k), as the number of elements passing through this block is relatively small (n n z ), the time spent adding all partial results of vector y from each accelerator in the reduction circuit is a small fraction of the total computational time. As evidenced by table IV, speedup gains increase almost linearly with the number of implemented accelerators. V. COMPARISON TO EXISTING SPMV ARCHITECTURES In table V, the performance of the implemented architecture is compared to previous architectures. Naturally, sparsity of the tested matrices influence the efficiency of the developed accelerator, as occurs on the remaining works. Sparse matrix parameters that adversely affect the implemented accelerator performance include the number of rows as it increases the size of the output vector required to be written to external memory, the ratio of matrix columns to nonzero elements as it influences the performance of the Packager unit and the number of hazards within the matrix, as each requires the PE to stall the FMA core until the hazard is resolved. All results were compared to a software-only execution in the ARM, optimized to take full advantage of the existing vector floating-point unit (128-bit SIMD) and operating at a frequency of 650MHz, more than six times the frequency at which the programmable logic operates (100 MHz). VI. CONCLUSION The results show that the architecture proposed in this work is able to achieve an average of 624 MFLOPS (single precision

8 8 TABLE IV HARDWARE AND SOFTWARE EXECUTION TIMES, IN MICROSECONDS (US), AND RESPECTIVE SPEEDUP OBTAINED FOR A SYSTEM WITH k SPMV ACCELERATORS USING COLUMN BLOCKS Number of mk10-b4 Maragal 2 Flower 5 4 bibd 14 7 Accelerators (k) HW ARM Speedup HW ARM Speedup HW ARM Speedup HW ARM Speedup FPGA TABLE V COMPARISON OF FPGA SPMV ARCHITECTURES [12] [13] [18] [14] [16] [15] This work Virtex-5 LX155T Virtex-II Pro 70 Stratix-III EP3E260 Virtex-II Pro 100 Virtex-II 6000 Virtex-5 SX95T Zynq XC7Z020 Frequency [MHz] Memory Bw. [GB/s] Number of PEs Peak Perf. [GFLOPS] Matrix Format CVBV CSR COO CSR SPAR CSC RCSC Sparsity Min - Max [%] Average [%] Efficiency Min - Max [%] Average [%] floating-point). This corresponds to a performance efficiency of 79%. These figures are better than those obtained by the ARM processor where 142 MFLOPS were measured, which corresponds to 22% of peak performance. This translates into a performance improvement of 4.39 on average. Given the predictions of the analytical model, it is expected that the architecture is able to scale with the available bandwidth. Speedup values of on average the execution in a general purpose processor are to be expected, as long as enough bandwidth is available, along with resources to implement the architecture in programmable logic. REFERENCES [1] C. Gfrerer (2012). Optimization Methods for Sparse Matrix-Vector Multiplication. In Seminar for Computer Science. [2] Y. Saad (2003). Iterative methods for sparse linear systems. In SIAM International Conference on Data Mining (SIAM 2003). SIAM. [3] R. W. Vuduc and H.-J. Moon (2005). Fast sparse matrixvector multiplication by exploiting variable block structure. In Proceedings of High-Performance Computing and Communications Conference (HPCC 2005), pages [4] A. Buttari, V. Eijkhout, J. Langou and S. Filippone (2005). Performance Optimization and Modeling of Blocked Sparse Kernels. In Technical Report ICL-UT [5] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick and J. Demmel (2007). Optimization of Sparse Matrix- Vector Multiplication on Emerging Multicore Platforms. In Supercomputing. [6] J. D. Davis and E. S. Chung (2012). SpMV: A Memory- Bound Application on the GPU Stuck Between a Rock and a Hard Place. In Microsoft Technical Report. [7] E.-J. Im and K. Yelick (1999). Optimizing Sparse Matrix Vector Multiplication on SMPs. In Ninth SIAM Conference on Parallel Processing for Scientific Computing (SIAM 1999). SIAM. [8] E.-J. Im and K. Yelick (1997). Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Sciences - Part I, pages [9] A. Pinar and M. T. Heath (1999). Improving performance of sparse matrix-vector multiplication. In Supercomputing. [10] Wilson M. José, Ana Rita Silva, Mário P. Véstias, Horácio C. Neto (2014) Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication. In Analog Integrated Circuits and Signal Processing, pages IEEE. [11] Mário P. Véstias (2014) DReaMaCHine - Design of a Reconfigurable Many-Core Architecture for High Perfor-

9 mance Computing. Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa (INESC ID/INESC/IST/ULisboa). [12] S. Kestur, J.D. Davis and E.S. Chung (2012) Towards a Universal FPGA Matrix-Vector Multiplication Architecture. In International Symposium Field-Programmable Custom Computing Machines, (FCCM 2012)., pages IEEE. [13] L. Zhuo and V.K. Prasanna (2005). Sparse Matrix- Vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2005), pages [14] Yan Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar and J. D. Bakos (2009). FPGA vs. GPU for sparse matrix vector multiply. In International Conference of Field-Programmable Technology (FPT 2009). pages [15] R. Dorrance, F. Ren and D. Markovic (2014). A Scalable Sparse Matrix-Vector Multiplication Kernel for Energy-Efficient Sparse-Blas on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2014), pages [16] D. Gregg, C. McSweeney, C. McElroy, F. Connor, S. McGettrick, D. Moloney and D. Geraghty (2007). FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory. In International Conference on Field Programmable Logic Applications (FPL 2007), pages [17] T. Davis. University of Florida Sparse Matrix Collection. Available: matrices. [18] S. Sun, M. Monga, P.H. Jones and J. Zambreno (2012). An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs. In IEEE Trans. Circuits Systems - Part I, (ITCS 2012), vol. 59, no. 1, pages IEEE. 9

Sparse Matrix-Vector Multiplication FPGA Implementation

UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...