FPGA Multi-Processor for Sparse Matrix Applications

Size: px
Start display at page:

Download "FPGA Multi-Processor for Sparse Matrix Applications"

Transcription

1 1 FPGA Multi-Processor for Sparse Matrix Applications João Pinhão, IST Abstract In this thesis it is proposed a novel computational efficient multi-core architecture for solving sparse matrix-vector multiply (SpMV) in FPGA. The efficient implementation of SpMV is challenging, as simple implementations of the kernels typically give a performance that is only a fraction of the peak. At the center of this problem is that sparse operations are more bandwidth-bound than dense ones. Previous works on the subject suggest that the use of FPGA for solving SpMV can improve performance levels when compared to the use of General Purpose Processors (GPPs), thereby improving total execution times when solving problems largely dependent of SpMV. As such, in this work the existing sparse matrix compression/storing formats are analyzed, their feasibility for an efficient implementation is verified, and lastly a multi-processor architecture is proposed, capable of better using the available bandwidth in order to achieve higher levels of performance. Through extensive experimentation on a wide dataset, the architecture exhibits the ability to outperform GPPs both in terms of peak and average performance. Given the target platform (Zynq), the performance of the developed architecture was compared to the ARM Cortex- A9 GPP present the platform. The architecture has shown to achieve an average of 78.5% ( MFLOPS) performance with a peak of 98.2% ( MFLOPS). On the other hand, the ARM processor attained an average of MFLOPS, equivalent to 21.8% of its peak performance, while achieving a peak of MFLOPS (31.4%) of performance in one test. This translates into a performance improvement of , averaging at 4.48 the performance of the ARM processor tested. Index Terms Sparse Matrices, FPGA, Multi-Core Architecture I. INTRODUCTION Sparse Linear algebra computations such as the matrixvector product or the solution of sparse linear systems are often the bottleneck of many scientific fields from computing fluid dynamics to structural engineering, electromagnetic analysis or even the study of economy models, to name a few. The task of computation regularly falls to CPUs and, due to the evolution in this field, performance is improving. However, the task of obtaining peak performance from modern cache-based computational systems has proven to be extremely difficult. Several factors contribute to this low performance such as the underlying machine architecture, memory access behavior, compiler technology and the nature of the input matrix which may only be known at runtime. In sparse matrices, the fraction of nonzero elements is small when compared to the total number of elements. While it is possible to use generic data structures and routines to perform computations with such matrices, this is inefficient as most calculations on zero elements are redundant and sometimes even impractical due to the large dimensions of these matrices. In practice, sparse matrices are stored using specialized data structures that only store the nonzero values, along with additional information regarding the position of each in the matrix, having its total size proportional to the number of nonzero elements. As an example of applicability in most of the fields named above, solving a partial differential equation using finite elements method boils down to solving a system of linear equations of the form y = Ax, where y and x are vectors and A is a large matrix that is mostly composed of zero entries. Nonzero elements of A would be arranged in a regular or irregular pattern depending on the selection of a structured or unstructured mesh for discretization of the original problem [1, 2]. The efficient implementation of these operations is extremely important; however, it is challenging as well, as simple implementations of the kernels typically give a performance that is only a fraction of peak [3, 4]. The center of its performance problem is that sparse operations are more bandwidth-bound than dense ones [5, 6]. Consequently, optimizations are needed, but these are much dependent on architectural variations even between closely related versions of the same processor. To this end, a number of optimization techniques have been proposed, such as register and cache blocking [7, 8], and column or row reordering [9]. The main contribution of this thesis is a novel approach to the efficient computation of the Sparse Matrix-Vector Multiply (SpMV) problem. To perform this, a reconfigurable multiprocessor architecture has been researched and developed, capable of sustaining high peak performance by avoiding the common pitfalls that affect SpMV computation and therefore achieving better computational efficiency. This was only made possible after a thorough analysis of the storing formats and by efficiently exploiting the advantages provided by programmable logic, such as reduced costs for reading and writing to local memory, pipelining the datapaths, etc. To complement the developed architecture and in order to further reduce the bandwidth required to perform SpMV, a novel storing format named RCSC is also proposed. Storing the sparse matrix according to this format further reduces the required bandwidth to perform SpMV without increasing the storage space needed, keeping in line with formats known to reduce size to a minimum. The use of this format removes the irregular memory accesses that limit the SpMV performance allowing for a streamline approach to data transfer. The removal of irregular accesses is the most important performance improvement factor for the SpMV problem, where the possibility of data reuse is very limited. The architecture is capable of receiving data via DMA [10] streaming and equally partition the computation amongst the implemented PE. These are capable of producing one result per clock cycle due to their pipelined FMA [11]. Although data retrieval from external memory and computation in the

2 2 PEs is fully overlapped, due to the unknown sparsity pattern of the matrix the process of writing the output vector to external memory cannot be overlapped. The architecture was implemented on an evaluation and development board based on the Xilinx Zynq-7020 All Programmable (AP) System-on-Chip (SoC). The Processing System (PS) of the Zynq-7020 device can provide a bandwidth of 1600MB/s which allows for an architecture composed of 4 PE to be implemented. Results wise, an average of 624 MFLOPS and 78.5% efficiency is attained for a dataset of 14 sparse matrices and input vectors. When compared to the performance of the ARM Cortex-A9, present in the PS side of the Zynq, execution times are improved by , while efficiency of the ARM GPP is on average 21.8%, equivalent to 142 MFLOPS. A. Storing Formats II. SPARSE MATRIX ALGORITHMS Enormous effort has been devoted to devise data storing formats with the aim of maximizing performance. This means that to fully optimize SpMV computation, we need to choose a compression algorithm that takes the sparse matrices structures into account. The focus of this section is to briefly describe the most common compression/storage formats available. The purpose of all different format variations is either to improve the architectures ability to access data and perform computations or to reduce the total space required to store the matrix. The most common format regardless of the used processor is CSR, which stores the nonzeros row-wise. Another favorite for implementations in programmable logic fabric is CSC, which stores the nonzeros column-wise. The use of other storing formats depends on the processor used, as ELLPACK and S- ELLPACK are most useful in GPU implementations due to their SIMD capabilities. 1) CSR: The Compressed Sparse Row (CSR) format stores an initial M N sparse matrix A in row form using three one-dimensional arrays. Let n z denote the number of nonzero elements of A. The first array is called V AL and is of length n z. This array holds the values of all the nonzero entries of A in left-to-right then top-to-bottom (row-major) order. The second array, COL IND, contains the column index (zero-based) of each element of V AL. The third and final array, ROW P T R, is of length M + 1 (i.e. one entry per row, plus one). ROW P T R(i) contains the index in A of the first nonzero element of row i. Row i of the original matrix A extends from V AL(ROW P T R(i)) to V AL(ROW P T R(i + 1) 1), i.e. from the start of one row to the last index before the start of the next. The last entry in ROW P T R (at zero-based index m) must be the number of elements in V AL (i.e. the number of nonzero elements of A). The name of this format is based on the fact that row index information is compressed, which delivers increased efficiency in arithmetic operations and row slicing partitioning algorithms. A depiction of sparse matrix A of figure 1 compressed in CSR format is shown in figure 2. 2) CSC: Compressed Sparse Column (CSC) format is analogous to CSR except that it follow a column-major order for ordering the nonzero elements in arrays. Instead of a column index, a row index is stored for each value, as are column pointers. This means the CSC format is composed of vectors V AL, ROW IND and COL P T R, where V AL is an array of the (top to bottom then left to right) nonzero values of the sparse matrix; ROW IN D is the row indexes corresponding to the values in V AL; and COL P T R contains the index in A of the first nonzero element of column j. Due to its efficiency in arithmetic operations, column slicing, and matrixvector products, this is the traditional format for specifying a sparse matrix in MATrix LABoratory (MATLAB). A depiction of sparse matrix A of figure 1 compressed in CSR format is shown in figure 3. Fig. 1. Sparse matrix A A = V AL = [ ] COL IN D = [ ] ROW P T R = [ ] Fig. 2. Sparse matrix A in CSR format V AL = [ ] ROW IN D = [ ] COL P T R = [ ] Fig. 3. Sparse matrix A in CSC format When comparing both of the previous formats, and assuming all vectors related to the compression format and vector x are stored in memory, CSC provides less accesses to memory when compared to CSR since the values of vector x are reused in the computation of each nonzero element belonging to the same column. For format CSR more values have to be transfered from memory for each computation. First, two values of vector ROW P T R need to be transfered. By computing ROW P T R[i + 1] ROW P T R[i], the number of nonzero elements in the row is obtained. Then, for each (value, index)- pair of nonzero elements transfered, the respective x[index] needs to be retrieved. This not only translates into a different value of vector x for each nonzero in every row, but also a delay in memory transfers and isolated accesses. This delay is caused by the necessary sequence in transfers of nonzero indexes and values while the isolated transfers are caused by the values of x not being necessarily adjacent and, as such, burst reading or caching of values not being very useful. These problems do not occur when dealing with a sparse matrix stored in CSC format. For every nonzero value and index transfered, values of x needed are reused for each column analyzed. This leads to improvements when using cache and burst readings from memory, an important factor in SpMV.

3 3 B. SpMV Algorithms Most of the previous relevant works on this area have been done on General Purpose Processors (GPPs) and GPUs, where improvements can mostly be done regarding the data stored in cache or in local GPU memory. An analysis of five FPGAbased architectures indicated that performance improvements to the SpMV problem depend on efficiently exploring the storing formats while minimizing the communication overheads. By approaching the problem in a row-wise fashion, two of the five analyzed works [12, 13] achieved low computational efficiencies due to limitations in scaling the implemented design. Another work [14] based on format CSR achieved a higher efficiency at the cost of locally storing a copy of vector x in each PE and therefore eliminating indirect references to this vector. The two remaining works [15, 16] approached the problem column-wise by choosing CSC and SPAR formats, respectively. By reducing the indirect references to vector x, the use of a column-major storing format provided a higher computational efficiency and improved the scalability of the architecture without increasing local memory requirements. In order to develop an efficient SpMV architecture, a suitable storing format must be chosen. The choice lies between row-major, column-major and other formats. This last option includes formats that imply variable arguments, i.e. a variable block size, which is only known at runtime. Therefore a hardware implementation of an SpMV accelerator based on these formats is likely to perform poorly. As such, by discarding these variable formats, the choice is now limited to row-major formats or column-major formats. The deciding factor is given by the ability of each format in minimizing one of the major problems in efficient SpMV computation: indirect memory references. If a sparse matrix is stored in a rowmajor format, e.g. CSR, the nonzero column indexes are also stored. These are the cause of the indirect memory references, as x[index] = x BASE ADDRESS + index is performed for each nonzero element in the matrix, and the index value is not likely to follow any pattern. Another factor to consider is that 3 load operations precede each nonzero computation. One to retrieve the nonzero element, another for the respective index, and one for the value of x given by the nonzero index. In turn, by assuming a sparse matrix stored in a columnmajor format, e.g CSC, the indexes represent the row of each nonzero element. In an FPGA-based implementation, by storing the output vector y in local memory consisting of BRAMs, these irregular accesses cost one clock cycle, as opposed to several in external memory. As such, it is reasonable to say that a column-major format is the most suitable for our implementation. The proposal herein is to modify the CSC format in which the sparse matrix is stored, such that only 2 vectors are used to represent the matrix-vector data, instead of 4. Vector x and vector V AL are merged in one vector still named V AL X, and vectors ROW IND and COL P T R are merged in one vector named IND P T R. Given the reasons presented above, an important number of improvements, such as the use of FMA units in order to pipeline computations, RAW hazard detection and methods of workload division across a configurable number of PEs, are proposed which aim to improve the computation of SpMV. Based on these improvements and the RCSC storing format, the development of a reconfigurable architecture capable of achieving high computation efficiency and reduced execution time is detailed. III. ARCHITECTURE OVERVIEW A. Hardware/Software Interface Evaluation In order to achieve the best possible utilization of the available bandwidth to ensure an optimized SpMV accelerator, the capabilities if the target device, the Zynq-7000, had to be evaluated. The available interfaces for integrating PLbased accelerators and the PS are: the AXI ACP, AXI HP and AXI GP. The AXI High Performance (HP) interface is composed of four high performance/high bandwidth master ports capable of 32-bit or 64-bit data transfers with separate read/write command issuing for concurrent operations, mediated by an asynchronous 1KB FIFO. Each of these interfaces can provide at most 1200MB/s when 64-bit wide, just as the AXI ACP, or 600MB/s for 32-bit wide. When using all four high performance ports, the total available throughput is 4800MB/s when using 64-bit mode or 2400MB/s when in 32- bit mode. All these throughputs assume a clock frequency of 150MHz. Hence, using the four available High Performance Ports to transfer data to/from an architecture implemented in the PL and external memory (i.e. DDR) provides four 32- bit words per clock cycle and an average throughput of 2538 MB/s. B. Workload Balancing No sparsity pattern is assumed for the input matrices. As such, data transfers and computational parallelism needs to be chosen and scheduled accordingly. The most common is the diagonal pattern 4(a) in which all nonzero values are located in the main matrix diagonal and/or scattered alongside it. It is included in the denomination of diagonal pattern matrices consisting of patterns with 3 diagonals 4(b), in which the nonzero values are located across 3 diagonals, one primary and two secondary. As such, assigning row ranges per PE results in an idle system the majority of the time. To solve this, single rows were alternately assigned to each processor in a round-robin scheme. This guarantees a good workload balance amongst the existing processors, as evidenced in figure 5, and thus, given a good hardware implementation of each PE, it is expected that the least number of PEs are idle for the least amount of time. (a) Sparse cdde1 Matrix (b) Sparse ck656 Fig. 4. Diagonal Sparse Matrices cdde1 and ck656 from [17] Matrix

4 4 Fig. 5. Nonzero assignment of a sample of 7 sparse matrices to a 4 PE system (percentage of total nonzero elements of the respective matrix) C. Architecture Fig. 6. RCSC Architecture Block Diagram Fig. 7. Processing Element Addressing the problems inherent to SpMV and, in conjunction with an exploration of the target device capabilities, an architecture that seeks to solve these problems was developed. This involved scheduling data transfers and computational parallelism accordingly. Through the methodical process of testing the architecture, an efficient and scalable architecture was developed, capable of processing any sparse matrix and input vector. The SpMV Accelerator (figure 6) consists of a Packager Unit, that processes and distributes the matrix/vector data to a Linear Processing Array, composed by p PEs connected linearly. Data is received at the Packager through two DMA units connected to two HP ports. Each Processing Element is composed of an input FIFO meant to store the instructions sent by the Packager to be processed, a delay block, a Fused Multiply-Add (FMA) [11] operator and local memory. Figure 7 shows a block diagram of each PE. The FMA performs single precision floating-point operations, only rounding and normalizing the result at the final stage, in order to maximize the precision of the multiply-add sequence. The FMA is implemented as an 8-stage pipeline, and can produce one result per clock cycle. When processing a row of the sparse matrix in a PE, a Read After Write (RAW) hazard may occur when the value entering is read before it is updated in local memory. To solve these, the delay block, whose purpose is to delay the address of local memory to be updated, is altered to indicate if a computation entering the FMA creates a RAW Hazard. When this occurs, stalls are implemented until the address is updated with the correct value. Testing of the architecture has shown that a limitation exists in the number of PEs that can be fed with data by the Packager. Given the need to verify a RAW hazard before each computation, the PEs requires two clock cycles per computation entering the FMA. Given that the Packager instructs one computation per clock cycle to the Linear Processor Array, more than two PEs resulted in idle processors. The novel format (RCSC) proposed in this work requires a lower bandwidth per accelerator and therefore allows to increase the number of parallel accelerators for the same fixed bandwidth. Given the HP interface capability of providing 128-bits per clock cycle and the developed SpMV accelerator requiring 64-bits, two accelerators can be implemented in the target device. Therefore, the best partitioning scheme between accelerators was also studied. We concluded that partitioning the sparse matrix in row blocks and assigning each to an accelerator lead to an increase in the number of RAW hazards in the blocks. Another limitation was the need to pre-process the matrix to construct these blocks, resulting in both more execution time and more external memory usage. On the other hand, partitioning the matrix in column blocks resulted in almost no pre-processing, as RCSC is already a columnmajor format. The processing of each block results in a partial vector per accelerator, with the final output vector y being the accumulation of all partial vectors. The best process of accumulating these vectors was also studied. Both options for reduction circuits, an Adder Tree and a Queues Cascade as pictured in figure 8, were studied. Given the execution times of both options, the Adder Tree was chosen as the most suitable to sum all partial vectors. IV. PERFORMANCE EVALUATION To verify the functionality, the architecture was implemented in the target device, the Zynq Figure 9 also includes the NEON MPE and VFPv3 cores of the Application Processing Unit (APU), as well as the timer. The NEON and the Vector Floating-Point s(vfp) cores extend the functionality of the ARM processor by providing support for advanced SIMD and vector floating-point v3 instruction sets. Some of the features of the NEON MPE core are SIMD vector and scalar single-precision floating-point computations. The vectors supported are 128-bit wide due to the registers used but will execute with 64-bits at a time. Since both Advanced SIMD and VFPv3 are implemented, they share a bit register bank. The inclusion of these cores on the schematic seen in figure 9 is tied to the optimizations done to the code run by the ARM processor. To compare computational times required to

5 5 (a) Adder Tree Fig. 9. Implementation of two SpMV Accelerators interfaces, which are connected to the AXI HP and AXI GP ports respectively. For each resource is also presented a percentage of the total available in the device. (b) Queues Cascade Fig. 8. Methods of reducing vector y in Column Blocks for a system composed of k SpMV Accelerators obtain both hardware and software results, the ARM processor was made sure to fully utilize the available resources. This implies using compiler high optimization levels (flags -O3 or -Otime), using NEON and VFP SIMD instructions (flag mfpu=neon-vfpv3), taking advantage of L1 and L2 caches, loop unrolling, function inlinning, and overall improvements to the algorithm that result in a reduced execution time. When comparing both hardware and software results, due to the use of a Fused Multiply-Adder in the PEs, several results are shown to be slightly different. This is to be expected, as the FMA unit performs a single rounding at the end of the fused multiplication-addition and the NEON unit truncates and then rounds the result. This yields a more accurate result from the accelerator when compared to the ARM processor, as is specified in the IEEE standard for Floating-Point Arithmetic 754. A. Experimental Results In terms of resource utilization and working frequency, two cases are presented in table I: resource utilization considering only the SpMV Accelerators (composed of two Packagers and four PEs), and total resource utilization considering the whole system, i.e. the SpMV Accelerators, four DMA cores and AXI Interconnects for the AXI4-FULL and AXI4-LITE TABLE I RESOURCE UTILIZATION OF TWO SPMV ACCELERATORS PLUS REDUCTION CIRCUIT AND SYSTEM OVERALL Resource Total Accelerators + Reduction System Registers (3,64%) 9473 (8,90%) LUTs (10,30%) (20,12%) BRAMs (5,71%) 36 (25,71%) DSP (3,64%) 8 (3,64%) The ability to scale the architecture is limited by the available memory bandwidth. For each additional accelerator, an additional 64-bits per clock cycle are required. The number of clock cycles required to perform sparse matrix-vector multiply on both the ARM processor (with all possible improvements) and the architecture implemented in programmable logic was measured. This measurement was performed by setting the timer to zero right before the start of computation with a posterior value retrieval. As such, the time required to compute a matrix was not directly measured, but rather the number of cycles passed in the timer during said computation. With the division of this value by the frequency of the timer (355MHz) the execution time can be obtained. The measurement of the time required to compute several sparse matrices from the University of Florida Sparse Matrix Collection [17] was measured using the above described process, resulting in the values presented in table II. Several sparse matrices were tested. The purpose of each test was to verify if the computation performed using the developed accelerator resulted in a lower execution time when compared to the same computation

6 6 performed in the ARM processor. To verify the reliability of the accelerator to process any sparse matrix, the chosen dataset II included matrices whose parameters vary greatly. From square (N N, e.g. bcsstm27) to rectangular matrices (M N, e.g. Flower 5 4) and diverse sparsity values ranging from as low as 0.02% (e.g. OPF 6000) to 23.08% (e.g. bibd 14 7). Within all matrices the number of hazards varies greatly, from zero (e.g. bibd 14 7) to 47.46% (e.g. N pid). Each of these parameters influences computational time, although only the accelerators are affected by RAW hazards, as the ARM processor follows a distinct execution path. Attention was also given to the existing patterns in the matrices and to the field of science each matrix belongs as to include the most possible types in the dataset. The accelerator achieves an average performance of 624 MFLOPS for a computational efficiency of 78.03%. This can be seen in figure 10 where the performance of both the ARM processor and the architecture are represented in terms of achieved MFLOPS per matrix in the dataset. The respective peak performances are represented as well, clearly showing that the ARM processor achieves a performance level well below peak. B. Analytical Model An analytical model was developed in order to predict the performance of the architecture with larger sparse matrices, and when using more SpMV Accelerators. Given that the transfer of data from the external memory and computation in the SpMV Accelerator are performed concurrently, execution time depends mostly on the communication performance. To develop an analytical model each component that constitutes the implemented accelerators was modeled by determining the number of clock cycles required to process a nonzero element. For the Packager unit, as one clock cycle is required per nonzero and structural element of the matrix, the total number of cycles required to process the entire sparse matrix is given by n z + n. As one instruction is sent to the PEs per nonzero element, the number of clock cycles required per nonzero element is given by equation 1. P ackager [cycles/nz] = n z + n n z (1) The number of cycles required by the PEs to perform computation depends on the number and distribution of hazards within the matrix. As no pattern can be assumed for the nonzero elements within a sparse matrix, a simple Bernoulli distribution is assumed for the occurrence of hazards. As such, the number of cycles required by each PE to process the nonzero elements arriving via the Packager is given by equation 2: P E [cycles/nz] =W orst stall p hazard + No stall (1 p hazard ) where p hazard is the probability of a hazard in the sparse matrix (equal to zero when no hazards occur and equal to one when a hazard exists for every nonzero element), W orst stall represents the highest number of clock cycles the PE needs to be stalled in order for the hazard to disperse and No stall (2) gives the minimum number of clock cycles required for the nonzero element to enter the FMA unit when no hazard is detected. The value of variable W orst stall is given by the number of pipeline stages of the FMA. In this architecture this value is always equal to 8. No stall is 2 as one clock cycle is required to read an address from the input FIFO and another for the corresponding y element to be at the output of local memory. As one SpMV Accelerator is composed of one Packager and two PEs, and given that operations in the Packager and the PEs are overlapped, the total number of clock cycles required to process a nonzero element is given by the component with the lowest throughput, the Packager or the two PEs, as shown by equation 3. ( Accelerator [cycles/nz] =max P ackager [cycles/nz] ; P E [cycles/nz] 2 The estimate for the total execution time must also consider the transfer of all elements of vector y to external memory. For this, the internal DMA FIFOs with a fixed length of 512 elements need to be filled with the maximum burst length (256 elements) before the transfer is started. After the first burst of 256 elements, the process of filling the FIFO and transmitting the elements of vector y is performed simultaneously. With this latency accounted for, the execution time for an implementation of k accelerators can be estimated by equation 4, where m is the number of elements of vector y, f represents the frequency at which the SpMV Accelerator works in MHz, and the reduction circuit now contributes to a larger portion of execution time as it adds a term dependent on the number of pipeline stages of the adder used (Adder stages) times the number of reduction levels ( log 2 (k) ), as previously depicted in figure 8: T heo exec time [us] = ( ( ) n z k max P E[cycles/nz] P ackager [cycles/nz] ; + 2 ) m + Adder stages log 2 (k) f To validate the developed model, estimated execution times were calculated for several sparse matrices using equation 4 and compared to the values measured from execution in the target device. Table III shows both experimental and estimated execution times (in microseconds), as well as the error between measurement and estimation. C. Estimation for k accelerators Assuming a target device with enough available bandwidth to feed a number k > 2 of accelerators, the analytical model can estimate the execution times for any sparse matrix. Execution times were estimated for all matrices that constitute the dataset, although, in order to avoid redundancy, matrices that ) (3) (4)

7 7 TABLE II EXECUTION TIME REQUIRED TO COMPUTE SPARSE MATRICES IN BOTH THE ARM AND ONE SPMV ACCELERATOR AND RESPECTIVE OBTAINED SPEEDUP AND ARCHITECTURAL EFFICIENCY Matrix Rows Columns Sparsity (%) Hazards (%) ARM (us) Accelerator (us) Speedup Efficiency (%) MK9-B MK10-B Maragal Flower bibd LP Pilot bcsstm qc big adder dcop p2p gnutella N pid OPF SiNa Fig. 10. ARM and Hardware performance measurement in MFLOPS for each matrix in the dataset TABLE III EXPERIMENTAL RESULTS VS. MODEL PREDICTIONS Matrix HW exec (us) Theo exec (us) Error (%) MK9-B MK10-B Maragal Flower bibd LP Pilot bcsstm qc big adder dcop p2p gnutella N pid OPF SiNa produced similar results are not shown in table III. The number of Accelerators was varied between three and seven and for comparison sake, results from implementations with one and two accelerators are also included. Despite the length of the reduction circuit increasing with the number of implemented accelerators (as given by the reduction time increasing with k), as the number of elements passing through this block is relatively small (n n z ), the time spent adding all partial results of vector y from each accelerator in the reduction circuit is a small fraction of the total computational time. As evidenced by table IV, speedup gains increase almost linearly with the number of implemented accelerators. V. COMPARISON TO EXISTING SPMV ARCHITECTURES In table V, the performance of the implemented architecture is compared to previous architectures. Naturally, sparsity of the tested matrices influence the efficiency of the developed accelerator, as occurs on the remaining works. Sparse matrix parameters that adversely affect the implemented accelerator performance include the number of rows as it increases the size of the output vector required to be written to external memory, the ratio of matrix columns to nonzero elements as it influences the performance of the Packager unit and the number of hazards within the matrix, as each requires the PE to stall the FMA core until the hazard is resolved. All results were compared to a software-only execution in the ARM, optimized to take full advantage of the existing vector floating-point unit (128-bit SIMD) and operating at a frequency of 650MHz, more than six times the frequency at which the programmable logic operates (100 MHz). VI. CONCLUSION The results show that the architecture proposed in this work is able to achieve an average of 624 MFLOPS (single precision

8 8 TABLE IV HARDWARE AND SOFTWARE EXECUTION TIMES, IN MICROSECONDS (US), AND RESPECTIVE SPEEDUP OBTAINED FOR A SYSTEM WITH k SPMV ACCELERATORS USING COLUMN BLOCKS Number of mk10-b4 Maragal 2 Flower 5 4 bibd 14 7 Accelerators (k) HW ARM Speedup HW ARM Speedup HW ARM Speedup HW ARM Speedup FPGA TABLE V COMPARISON OF FPGA SPMV ARCHITECTURES [12] [13] [18] [14] [16] [15] This work Virtex-5 LX155T Virtex-II Pro 70 Stratix-III EP3E260 Virtex-II Pro 100 Virtex-II 6000 Virtex-5 SX95T Zynq XC7Z020 Frequency [MHz] Memory Bw. [GB/s] Number of PEs Peak Perf. [GFLOPS] Matrix Format CVBV CSR COO CSR SPAR CSC RCSC Sparsity Min - Max [%] Average [%] Efficiency Min - Max [%] Average [%] floating-point). This corresponds to a performance efficiency of 79%. These figures are better than those obtained by the ARM processor where 142 MFLOPS were measured, which corresponds to 22% of peak performance. This translates into a performance improvement of 4.39 on average. Given the predictions of the analytical model, it is expected that the architecture is able to scale with the available bandwidth. Speedup values of on average the execution in a general purpose processor are to be expected, as long as enough bandwidth is available, along with resources to implement the architecture in programmable logic. REFERENCES [1] C. Gfrerer (2012). Optimization Methods for Sparse Matrix-Vector Multiplication. In Seminar for Computer Science. [2] Y. Saad (2003). Iterative methods for sparse linear systems. In SIAM International Conference on Data Mining (SIAM 2003). SIAM. [3] R. W. Vuduc and H.-J. Moon (2005). Fast sparse matrixvector multiplication by exploiting variable block structure. In Proceedings of High-Performance Computing and Communications Conference (HPCC 2005), pages [4] A. Buttari, V. Eijkhout, J. Langou and S. Filippone (2005). Performance Optimization and Modeling of Blocked Sparse Kernels. In Technical Report ICL-UT [5] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick and J. Demmel (2007). Optimization of Sparse Matrix- Vector Multiplication on Emerging Multicore Platforms. In Supercomputing. [6] J. D. Davis and E. S. Chung (2012). SpMV: A Memory- Bound Application on the GPU Stuck Between a Rock and a Hard Place. In Microsoft Technical Report. [7] E.-J. Im and K. Yelick (1999). Optimizing Sparse Matrix Vector Multiplication on SMPs. In Ninth SIAM Conference on Parallel Processing for Scientific Computing (SIAM 1999). SIAM. [8] E.-J. Im and K. Yelick (1997). Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Sciences - Part I, pages [9] A. Pinar and M. T. Heath (1999). Improving performance of sparse matrix-vector multiplication. In Supercomputing. [10] Wilson M. José, Ana Rita Silva, Mário P. Véstias, Horácio C. Neto (2014) Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication. In Analog Integrated Circuits and Signal Processing, pages IEEE. [11] Mário P. Véstias (2014) DReaMaCHine - Design of a Reconfigurable Many-Core Architecture for High Perfor-

9 mance Computing. Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa (INESC ID/INESC/IST/ULisboa). [12] S. Kestur, J.D. Davis and E.S. Chung (2012) Towards a Universal FPGA Matrix-Vector Multiplication Architecture. In International Symposium Field-Programmable Custom Computing Machines, (FCCM 2012)., pages IEEE. [13] L. Zhuo and V.K. Prasanna (2005). Sparse Matrix- Vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2005), pages [14] Yan Zhang, Y. H. Shalabi, R. Jain, K. K. Nagar and J. D. Bakos (2009). FPGA vs. GPU for sparse matrix vector multiply. In International Conference of Field-Programmable Technology (FPT 2009). pages [15] R. Dorrance, F. Ren and D. Markovic (2014). A Scalable Sparse Matrix-Vector Multiplication Kernel for Energy-Efficient Sparse-Blas on FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2014), pages [16] D. Gregg, C. McSweeney, C. McElroy, F. Connor, S. McGettrick, D. Moloney and D. Geraghty (2007). FPGA Based Sparse Matrix Vector Multiplication using Commodity DRAM Memory. In International Conference on Field Programmable Logic Applications (FPL 2007), pages [17] T. Davis. University of Florida Sparse Matrix Collection. Available: matrices. [18] S. Sun, M. Monga, P.H. Jones and J. Zambreno (2012). An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs. In IEEE Trans. Circuits Systems - Part I, (ITCS 2012), vol. 59, no. 1, pages IEEE. 9

Sparse Matrix-Vector Multiplication FPGA Implementation

Sparse Matrix-Vector Multiplication FPGA Implementation UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...

More information

DESIGNING HARDWARE/SOFTWARE SYSTEMS FOR EMBEDDED HIGH-PERFORMANCE COMPUTING

DESIGNING HARDWARE/SOFTWARE SYSTEMS FOR EMBEDDED HIGH-PERFORMANCE COMPUTING DESIGNING HARDWARE/SOFTWARE SYSTEMS FOR EMBEDDED HIGH-PERFORMANCE COMPUTING Mário P. Véstias, Rui Policarpo Duarte, Horácio C. Neto INESC-ID, ISEL - Instituto Superior de Engenharia de Lisboa, Instituto

More information

Towards a Universal FPGA Matrix-Vector Multiplication Architecture

Towards a Universal FPGA Matrix-Vector Multiplication Architecture Towards a Universal FPGA Matrix-Vector Multiplication Architecture Srinidhi Kestur, John D. Davis, Eric S. Chung Dept. of Computer Science and Engineering, The Pennsylvania State University Microsoft Research

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

An Integrated Reduction Technique for a Double Precision Accumulator

An Integrated Reduction Technique for a Double Precision Accumulator An Integrated Reduction Technique for a Double Precision Accumulator Krishna K. Nagar Dept. of Computer Sci.and Engr. University of South Carolina Columbia, SC 29208 USA nagar@cse.sc.edu Yan Zhang Dept.

More information

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be

More information

Design of Adder Tree Based Sparse Matrix- Vector Multiplier

Design of Adder Tree Based Sparse Matrix- Vector Multiplier UNIVERSITY OF CALIFORNIA, LOS ANGELES Design of Adder Tree Based Sparse Matrix- Vector Multiplier TIYASA MITRA (SID: 504-362-405) 12/02/2014 Table of Contents Table of Contents... 2 1 Introduction... 3

More information

FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method

FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method Computer Physics Communications 178 (2008) 558 570 www.elsevier.com/locate/cpc FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method Yousef Elkurdi,

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns Xinying Wang, Phillip H. Jones and Joseph Zambreno Department of Electrical and Computer Engineering Iowa State

More information

Partitioning of computationally intensive tasks between FPGA and CPUs

Partitioning of computationally intensive tasks between FPGA and CPUs Partitioning of computationally intensive tasks between FPGA and CPUs Tobias Welti, MSc (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland tobias.welti@zhaw.ch

More information

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Mapping Sparse Matrix-Vector Multiplication on FPGAs Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 University of Tennessee, Knoxville 1 Oak Ridge National Laboratory 2 [jsun5, gdp]@utk.edu 1, Olaf@ornl.gov

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

FPGA Multi-Processor for Sparse Matrix Applications

FPGA Multi-Processor for Sparse Matrix Applications FPGA Multi-Processor for Sparse Matrix Applications João Miguel Salvado Pinhão Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering Supervisor: Prof. Dr. Horácio Claúdio

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory bound computation, sparse linear algebra, OSKI Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh ATLAS Mflop/s Compile Execute

More information

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 13 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato ATLAS Mflop/s Compile Execute Measure Detect Hardware Parameters L1Size NR MulAdd

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Accelerating Double Precision Sparse Matrix Vector Multiplication on FPGAs

Accelerating Double Precision Sparse Matrix Vector Multiplication on FPGAs Accelerating Double Precision Sparse Matrix Vector Multiplication on FPGAs A dissertation submitted in partial fulfillment of the requirements for the degree of Bachelor of Technology and Master of Technology

More information

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Adaptable benchmarks for register blocked sparse matrix-vector multiplication Adaptable benchmarks for register blocked sparse matrix-vector multiplication Berkeley Benchmarking and Optimization group (BeBOP) Hormozd Gahvari and Mark Hoemmen Based on research of: Eun-Jin Im Rich

More information

Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs

Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Computer Science Faculty of EEMCS Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Master thesis August 15, 2008 Supervisor: dr.ir. A.B.J. Kokkeler Committee: dr.ir. A.B.J.

More information

High-Performance Linear Algebra Processor using FPGA

High-Performance Linear Algebra Processor using FPGA High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Kiran Kumar Matam Computer Science Department University of Southern California Email: kmatam@usc.edu Hoang Le and Viktor K.

More information

A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS

A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS 2011 21st International Conference on Field Programmable Logic and Applications A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS Colin Yu Lin, Hayden Kwok-Hay So Electrical and Electronic Engineering

More information

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication

More information

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA Optimizing Data Locality for Iterative Matrix Solvers on CUDA Raymond Flagg, Jason Monk, Yifeng Zhu PhD., Bruce Segee PhD. Department of Electrical and Computer Engineering, University of Maine, Orono,

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

Optimising Sparse Matrix Vector Multiplication for Large Scale FEM problems on FPGA

Optimising Sparse Matrix Vector Multiplication for Large Scale FEM problems on FPGA Optimising Sparse Matrix Vector Multiplication for Large Scale FEM problems on FPGA Paul Grigoraş, Pavel Burovskiy, Wayne Luk, Spencer Sherwin Department of Computing, Imperial College London Department

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers

Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers The International Arab Journal of Information Technology, Vol. 8, No., April Blocked-Based Sparse Matrix-Vector Multiplication on Distributed Memory Parallel Computers Rukhsana Shahnaz and Anila Usman

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

CASK - Open-Source Custom Architectures for Sparse Kernels

CASK - Open-Source Custom Architectures for Sparse Kernels CASK - Open-Source Custom Architectures for Sparse Kernels Paul Grigoras Department of Computing Imperial College London paul.grigoras09@imperial.ac.u Pavel Burovsiy Department of Computing Imperial College

More information

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra

Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Generating and Automatically Tuning OpenCL Code for Sparse Linear Algebra Dominik Grewe Anton Lokhmotov Media Processing Division ARM School of Informatics University of Edinburgh December 13, 2010 Introduction

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex Technical Report UCR-CSE-2013-02011 A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex Robert Halstead and Walid Najjar Computer Science & Engineering UC Riverside Riverside, CA 92521 (rhalstea,najjar)@cs.ucr.edu

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Storage Formats for Sparse Matrices in Java

Storage Formats for Sparse Matrices in Java Storage Formats for Sparse Matrices in Java Mikel Luján, Anila Usman, Patrick Hardie, T.L. Freeman, and John R. Gurd Centre for Novel Computing, The University of Manchester, Oxford Road, Manchester M13

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

HEAD HardwarE Accelerated Deduplication

HEAD HardwarE Accelerated Deduplication HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version

More information

ECE 5775 Student-Led Discussions (10/16)

ECE 5775 Student-Led Discussions (10/16) ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA Da Tong, Viktor Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Email: {datong, prasanna}@usc.edu

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015. Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Architectures of Flynn s taxonomy -- A Comparison of Methods

Architectures of Flynn s taxonomy -- A Comparison of Methods Architectures of Flynn s taxonomy -- A Comparison of Methods Neha K. Shinde Student, Department of Electronic Engineering, J D College of Engineering and Management, RTM Nagpur University, Maharashtra,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks

Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks Abstract Deep Convolutional Neural Networks (DCNN) have proven to be very effective in many pattern recognition applications, such

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform J Supercomput (2013) 63:710 721 DOI 10.1007/s11227-011-0626-0 Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform Shiming Xu Wei Xue Hai Xiang Lin Published

More information

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time

More information

Composite Metrics for System Throughput in HPC

Composite Metrics for System Throughput in HPC Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

Mark Sandstrom ThroughPuter, Inc.

Mark Sandstrom ThroughPuter, Inc. Hardware Implemented Scheduler, Placer, Inter-Task Communications and IO System Functions for Many Processors Dynamically Shared among Multiple Applications Mark Sandstrom ThroughPuter, Inc mark@throughputercom

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 59, NO. 1, JANUARY

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 59, NO. 1, JANUARY IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 59, NO. 1, JANUARY 2012 113 An I/O Bandwidth-Sensitive Sparse Matrix-Vector Multiplication Engine on FPGAs Song Sun, Madhu Monga, Phillip

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Sparse Linear Solver for Power System Analyis using FPGA

Sparse Linear Solver for Power System Analyis using FPGA Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.

More information

Unstructured Finite Element Computations on. Configurable Computers

Unstructured Finite Element Computations on. Configurable Computers Unstructured Finite Element Computations on Configurable Computers by Karthik Ramachandran Thesis submitted to the Faculty of Virginia Polytechnic Institute and State University in partial fulfillment

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Techniques for Optimizing FEM/MoM Codes

Techniques for Optimizing FEM/MoM Codes Techniques for Optimizing FEM/MoM Codes Y. Ji, T. H. Hubing, and H. Wang Electromagnetic Compatibility Laboratory Department of Electrical & Computer Engineering University of Missouri-Rolla Rolla, MO

More information