Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture

Size: px

Start display at page:

Download "Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture"

Douglas Jackson
5 years ago
Views:

009 International Conference on Reconfigurable Computing and FPGAs Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture Jiang Jiang 1 Vincent Mirian Kam Pui Tang Paul

1 009 International Conference on Reconfigurable Computing and FPGAs Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture Jiang Jiang 1 Vincent Mirian Kam Pui Tang Paul Chow Zuocheng Xing 1 1 School of Computer National University of Defense Technology Changsha, Hunan, P.R. China, Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada jiangjiangdr@hotmail.com, {mirianvi, tangkamp, pc}@eecg.utoronto.ca, zcxing@nudt.edu.cn Abstract In this paper, we introduce a scalable macropipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A -PE design for matrix size ranging from * to 104*104 is also simulated. Our experiment shows that we have achieved 1.18 GFLOPS with PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect. Keywords- matrix multiplication; temporal parallelism, macro-pipeline; FPGA accelerator I. INTRODUCTION The fundamental way to improve the performance of the computer system is to exploit the parallelism. There are two kinds of parallelism: spatial parallelism and temporal parallelism. Spatial parallelism tries to use duplicated function units (FUs), multiple cores or even multiprocessors to run at the same time on different data sets. Current researches mainly focus on it, such as superscalar, multicore, graphical processing unit (GPU) and multiprocessor system. Temporal parallelism tries to use multi-stage pipeline or macro-pipeline to partition application to multiply phases and data sets and run simultaneously. Clearly, temporal parallelism and spatial parallelism can yield the same potential speedup. Matrix multiplication is a typical routine/kernel for scientific application. The LINPACK Benchmark has been used for many years to evaluate computer systems. The LINPACK Basic Linear Algebra Subprograms (BLAS) are high quality building block routines performing basic vector and matrix operations. The popular Level BLAS mostly target matrix-matrix operations of order O(n ) [1]. There are attempts to improve the efficiency of the matrix multiplication algorithm. Coprocessors and accelerators have been implemented using GPU, field programmable gate arrays (FPGA), digital signal processing (DSP), and application specific integrated circuit (ASIC). Most implementations use spatial parallelism approach. Efficient algorithms have also been proposed and optimized libraries for specific computer architecture have been designed. The proposed Scalable Macro-pipelined architecture (SMPA) exploits temporal parallelism, not found in [1]. The implementation uses DSP to function as multiplieraccumulator (MAC) PE as described in [] and [], and uses distributed memory as optimal memory architecture as discussed in []. The algorithm uses many dimensions of parallelism, along the computation axis of the PE and the communication axis amongst multiple PEs. SMPA is a generic platform not optimized for any software library as in [4] and [5]. The algorithm does not use multiple FPGAs as in [6], however is scalable to multiple FPGAs. The utilization of registers is minimal in order to reduce energy consumption as argued in [7], [8] and [9]. Unlike [] and [11], a ring topology is used by SMPA for connecting PEs. And unlike [10], a front end control unit is designed to control PE operations. The remainder of this paper is organized as follows. In Section II, an overview of the proposed SMPA is presented. The matrix multiplication algorithm based on SMPA is described in Section III and the performance is analyzed in Section IV. Section V describes the proposed SMPA FPGA hardware implementation, and Section VI presents experimental results. Section VII concludes our discussions and introduces the future work. HOST Memory II. BWINF SCALABLE MACRO-PIPELINED ARCHITECTURE INF CU LMA0 BWto-LMR BWMEM MC PE0 Off-Chip Memory LMB0 PE1 PE Ring PE... BWto-PER... FPGA Accelerator PEp-1 LMA1 LMB1 LMA LMB LMAp-1 LMBp-1 LM Ring Figure 1. SMPA Accelerator As illustrated in Fig. 1, the accelerator system composes two parts: the host and the FPGA accelerator. The host offloads the multiplication task to the accelerator. Its responsibilities include data reordering of the multiplicand matrices, sending the reordered data to the accelerator, receiving the product data from the accelerator and data reordering of the product. The host is also responsible for writing and reading control registers in the accelerator to /09 $ IEEE DOI /ReConFig

2 configure execution modes and get performance statistic information respectively. The FPGA accelerator is used for matrix operations. It consists of five components: PE ring (PER), local memory ring (LMR), control unit (CU), host-accelerator interface (INF) and memory controller (MC). The PER includes multiple PEs, which are connected in ring structure and they operate in pipeline fashion. Each stage of the pipeline (or PE) can be a multiplier-accumulator (MAC) or a powerful microprocessor. Intermediate results are passed from one stage to the next stage. There are two reasons to use ring connection. First, it can support different matrix sizes. When the number of columns of matrix A (or the number of rows of matrix B) is larger than the number of PE, we need multiple iterations. Second, it s scalable. The ring topology is easy to scale up or down. Each PE has two local memories (LMs). One of them stores the data from matrix A, and another stores the data from matrix B. The LMs are also organized in ring structure. All of the data pumped into the LMR will be digested by LMs. The INF is the communication interface between the host and the accelerator. It means to operate at high speed so as not to create a bottleneck for the accelerator. The CU is used to control the source and the result matrix data flow. If communication bandwidth of the INF is not fast enough to meet the bandwidth requirement of the LMs, the source data from INF will be diverted to off-chip memory. After that, the CU will read the whole data out from the offchip memory and feed them to LMR. Similarly, result data will be written back to the off-chip memory before sending back to the host. Of course, if the bandwidth of the INF is fast enough, there is no need to use off-chip memory. The advantages of the proposed SMPA are: Effectively overlap communication and computation; start computation as early as possible. No extra memory to store the intermediate result. All of them are kept at PE s internal pipeline stages. Small on-chip local memory requirement. Data set for each PE is regular and can be reused. The interconnect topology is simple. Unidirectional ring topology is used to support various problem sizes. It adopts data flow fashion. Control logic is very simple. Once the pipeline is filled up, it can yield comparable performance as spatial parallelism. Scalable. Pipeline/ring is easy to scale up and down. III. MATRIX MULTIPLICATION BASED ON SMPA In general, the standard matrix multiplication, C = A*B, is defined as follows: n Ci, j = Ai, k * Bk, j (1) k = 1 Matrix multiplication has intrinsic property of pipeline. An efficient way to use pipeline architecture for matrix multiplication is that each stage (PE) of the macro-pipeline calculates the product of two elements from matrix A and B respectively, and then accumulates the intermediate result from previous stage. The matrix multiplication algorithm based on the SMPA architecture can be described with the following pseudo-code as shown in Fig.. for (c=0; c<= n By/m B-1; c++) // column # of B for (r=0; r<= n Ax/m A-1; r++) // row # of A for (j=0; j<= n/p-1; j++) // round # for (i=0; i <= p-1; i ++) // PE # for (d A= 0; d A <=m A-1; d A ++) // entry # of LMA for (d B= 0; d B <=m B-1; d B ++) // entry # of LMB C stg(m A *r+ d A, m B*c+ d B) (p*j+i) = A(m A *r+ d A, p*j+ i) // LMA, entry d A * B(p*j+ i, m B*c+ d B) // LMB, entry d B + C stg(m A *r+ d A, m B*c+ d B) (p*j+(i-1)) //from PE i-1 Figure. Matrix multiplication algorithm based on SMPA Let s assume that the number of PEs in PER is p, the number of pipeline stages in a PE is s. So, the total number of pipeline stages of the whole PER is p*s. The size of matrix A is n Ax *n, and the size of matrix B is n*n By. The depth of LMA is m A and that of LMB is m B. They are expressed in number of words. The byte size of the word is z. It is 8 for double precision FP and 4 for single precision FP. Essentially, this algorithm is a block matrix multiplication as illustrated in Fig.. For simplicity, we consider the matrix dimensions to be multiples of the submatrix dimensions. We can prove that the algorithm in Fig. is equal to (1). The outmost two loops (c, r) are loops for sub-matrices. The loop variable c refers to the number of sub-matrix in matrix B, and the loop variable r refers to the number of submatrix in matrix A. The size of sub-matrix A is m A *n, while the size of sub-matrix B is n*m B. The size of result submatrix C is thus m A *m B, as shown in Fig.. If n is larger than p, there needs to loop back n/p times. There are only p valid MAC operations in each round although there are p*s stages in the whole PER. To fill up the PER and to reduce to-lmr bandwidth, we reuse the data from matrix B for m A times. In this way, our approach reduces the bandwidth of matrix B reading by m A times. In Fig., the inner two loop variables (d A, d B ) refer to the entry number of LMA and LMB. The p*s cycles are defined as the unit round time, and we can get: p * s = m A * m B () In order to effectively use the data from matrix A and B, the host needs to reorder the source data to the order shown in Fig.. IV. PERFORMANCE ANALYSIS In Fig., loop variable i refers to the number of PEs. In each round, each PE calculates p*s MAC operations, and consumes all of the data in LMA and LMB. In the next round, all of data in LMA and LMB change to new data. Therefore, the round time is the main constrain. 49

3 Figure. SMPA Matrix Multiplication For simplicity of discussion and without loss of generality, we consider square matrix multiplication of n*n. A. Local Memory Capacity From (), m A = p*s/m B. So, the total LM capacity is: LM = z p *( m + p * s / m ), [byte] () total * B B B. Bandwidth As shown in Fig. 1, there are four bandwidth requirements for the proposed SMPA architecture: BW to-per : aggregate bandwidth from LMs to PEs BW to-lmr : aggregate bandwidth from off-chip memory or directly from INF to LMs BW MEM : bandwidth of off-chip memory BW INF : bandwidth between host and accelerator For the BW to-per, we have: BW to PER = * z * p, [byte/cycle] (4) BW to-per is very high comparing to BW to-lmr. In order to maintain the bandwidth, we exploit the data reuse in LM during each round. After each round (p*s cycles), all data in LMA and LMB should be renewed. So we get: BW to LMR = LM total /( p * s) = z *( mb / s + p / mb ), [byte/cycle] (5) If BW INF >= BW to-lmr, there is no need to use off-chip memory. If not, BW MEM should be equal or greater than BW to-lmr. Otherwise, PER will partially stall when any PE cannot get source data from LMA or LMB. C. Turnaround Time and Peak Performance From () and (5), in order to minimize BW to-lmr and LM total, we can get: m = m = p s (6) A B * The turnaround time T is defined as the time between the host sending source matrices data to the accelerator and the host receiving the result matrix data. If the implementation uses off-chip memory, we get: * z * n n * z * n z * n * n/ ma T = + max(( ),( + )), BWINF p BWto _ LMR BWto _ LMR [cycle] (7) If the source matrices data and result matrix data are sent directly from INF to the LMs, we get: n * z * n z * n * n/ ma T = max(( ),( + )), [cycle] (8) p BW BW to _ LMR to _ LMR In our case, the PEs are fast enough, and will consume all of the data that LMs feeds to PEs. Therefore, BW to_lmr or BW INF becomes the bottleneck of the accelerator. Consider (6), from (7) & (8), we get: + n/ p * s T = z * n *( + ), [cycle] (9) BW INF BW to _ LMR z * n n T = *( + ), [cycle] (10) BW p* s to _ LMR Assume that, for peak performance, BW to-lmr should be fast enough so that it is not the bottleneck. The computation time of the PEs thus becomes the dominating factor. From (7) and (8), the minimum turnaround time T min is: 50

4 T min * z * n n = +, [cycle] (11) BW p INF The other thread is responsible for receiving the product data from the accelerator and data reordering of the product. n Tmin =, [cycle] (1) p cu_per_ctrl So, the peak performance is shown in (1). For example, for a system of PEs running at 00 MHz, the peak performance of the system is 1.8 GFLOPS. 18b PERF * n = lim n T peak = * min p, [fp op/cycle] (1) cu_mc_ctrl 18b 18b 18b D. Scalability From () and (5), if s=16 and z=8, we can get: LM total = 8*p*(m B + p*16/m B ) bytes, and BW to-lmr = 8*(m B /16 + p/m B ) bytes/cycle. For a certain number of PEs (p) and pipeline stages (s), we can adjust m B (which affects m A as well by ()) to find the least BW to-lmr and LM total. As shown in Table I, the optimal choices happen to appear to be at the diagonal line, which are marked with the bold font. Table I indicates that SMPA has attractive scalability. Even in case of p=51, z=8 (double precision FP), LM total is only 768KB, and BW to-lmr is 96B/c. For this case, multiple FPGAs, high bandwidth memory controller or multiple lanes PCIe can be used to meet the computation resource and bandwidth requirements. Furthermore, SMPA uses ring topology, so it s very easy to scale up and down. Figure 4. Control Unit data path B. Control Unit The datapath of the CU is illustrated in Fig. 4. Currently, the CU supports modes: Source data are sent directly from INF to LMR, and the result data are sent back to host directly from PER through INF after processing. Source data are firstly sent from INF to MC and then read by LMR. After processing, the result data are temporarily written back to MC and then read by INF, then to host. C. PE Ring and LM Ring TABLE I. BWTO-LMR AND LM CAPACITY (S=16, Z=8) m B p=16 p= p=64 p=18 p=56 p= B/c, 4KB 4B/c, 1KB 40B/c, 40KB 7B/c, 144KB 16B/c, 544KB 64B/c, 11KB 0B/c, 5KB 4B/c, 1KB B/c, KB 48B/c, 96KB 80B/c, 0KB 144B/c, 115KB 64 4B/c, 8.5KB 6B/c, 18KB 40B/c, 40KB 48B/c, 96KB 64B/c, 56KB 96B/c, 768KB 18 65B/c, 16.5KB 66B/c, KB 68B/c, 68KB 7B/c, 144KB 80B/c, 0KB 96B/c, 768KB V. FPGA IMPLEMENTATION To verify our proposed architecture, we implement a - bit single precision floating point matrix multiplication FPGA accelerator using SMPA. A. Host and Accelerator s Interface The host is implemented with C code and compiled with no optimization. The host contains an Intel Core Duo processor P8600 (.4GHz) with MB L cache and GB DDR SDRAM running Ubuntu Gigabit communication medium is used between the host and the accelerator. There are two threads running on the machine. One thread is responsible for data reordering of the multiplicand matrices and sending the reordered data to the accelerator. Figure 5. PE and LM rings The architecture of PER is shown in Fig. 5. Each PE includes a multiplier and an accumulator. We use Virtex-5 DSP48E to implement these functional units. PE executes in data flow fashion. As depicted in Fig. 1, when the off-chip memory or host can not provide enough data for PER, the source data from LMA x / LMB x to PE x, or the intermediate result from previous stage may not be ready (data_valid). In this case, the PE x will be stalled. Since there are many PEs, it becomes very difficult to stall the whole PER at the same cycle. To solve this problem, we insert a stall buffer between PEs. When stall occurs in any given PE, the previous PE isn t stalled immediately but continues to send the intermediate results to the stall buffer, until the stall buffer is full (ring_stall). In this way, we isolate the stall domain to only one PE so that we can improve the frequency and performance. 51

5 The data width of LMR equals to the data width of the DDR SDRAM. LM is implemented using dual-port Block RAM (BRAM) so that LM can read source data from SDRAM and feed them to PE simultaneously. 18-bit data bus Control unit User I/F Write data FIFO Write/read address /command FIFO Clock generation Write data Controller Read data Clocks (clk0, clk90, clk180, clk0div) PHY layer 64-bit data bus DDR SDRAM Figure 6. DDR SDRAM memory controller interface [1] D. Memory Controller Xilinx DDR MC is used to transfer multiplicands data (A & B) and the result matrix (C) from and to external DDR off-chip memory modules. The total size of the memory on ML507 is 56MB. To improve memory performance, a burst length of bit words per cycles is used. The PHY module of the memory controller is responsible for the timing calibration to make sure the data and data strobes aligned for both write and read at a clock rate of 00 MHz. Fig. 6 depicts the major functional blocks and interface of the DDR SDRAM controller. VI. EXPERIMENTAL RESULTS A 16-PE -bit accelerator design operating at 15 MHz is implemented on Xilinx ML507 development board to verify functionality of the SMPA. Two test cases (* and 64*64) are generated from the host and fed through the gigabit Ethernet MAC (EMAC) interface into the FPGA hardware. Results of the matrix multiplication from the hardware are verified. The total turnaround time of the 16-PE SMPA for * and 64*64 matrix multiplication is 7.9ms and 7.68ms respectively. The reported turnaround time includes communication overheads of the EMAC interface and host program, such as data setup and retrieval times from EMAC core to SDRAM, and vice versa, and Linux kernel and operating system functional calls. TABLE II. SIMULATION TIME (IN NUMBER OF CYCLES) Matrix size * 64*64 18*18 56*56 51*51 104*104 inf_mem_writing 1,610 54,656 18, ,65,50,86 14,011,54 cu_core_executing 1,795 10,048 7, ,097 4,445,45 5,65,941 inf_mem_reading 4,06 7,99 10, ,4 1,988,574 7,965,447 total # cycles 19,441 9,10 41,19 1,96,091 9,96,85 57,4,91 TABLE III. MATRIX CALCULATION TIME (IN MS) Matrix size * 64*64 18*18 56*56 51*51 104*104 MATLAB BLAS MATLAB MKL smpa MHz smpa exec MHz TABLE IV. GFLOPS AND PE USAGE Matrix size * 64*64 18*18 56*56 51*51 104*104 MAC OP GFLOPS PE Usage 57.0% 81.6% 89.1% 9.6% 94.4% 95.% Due to FPGA resource limitation on ML507, a larger design with PEs is not implemented, but instead simulated to test the floating point matrix multiplication of matrix size ranging from * to 104*104. Test frequency is set at 00 MHz for the MC, CU, PER and LMR. The gigabit EMAC core however operates at 15 MHz which is generated internally by digital clock management (DCM) macro from a source clock of 100 MHz. The simulation cycles are broken into segments: 1. inf_mem_writing: Source matrices (A and B) writing from EMAC PHY, through INF to MC, then to SDRAM.. cu_core_executing: Source matrices reading, from SDRAM through MC to LMR, then to PER; Data processing (pipelined PEs); Result matrix (C) writing, from PER through MC to SDRAM.. inf_mem_reading: Result matrix reading from SDRAM through MC to INF, then to EMAC PHY. # Cycles Exec. time (ms) 1.00E E E E E E E E E+00 SMPA Matrix Multiplication ( PEs, 00MHz) total # cycles cu_core_executing inf_mem_writing inf_mem_reading Matrix size (Single precision FP) Figure 7. Simulation time (cycles) in logarithmic scale SMPA Matrix Multiplication ( vs. Matlab) smpa MHz MATLAB BLAS smpa exec MHz MATLAB MKL Matrix size (Single precision FP) Figure 8. Simulation time in logarithmic scale These simulation times for matrix size of * through 104*104 are shown in Table II. Fig. 7 shows the execution times in logarithmic scale. The total execution 5

6 time that includes data transfers (inf_mem_writing, cu_core_executing time and inf_mem_reading) and the PE execution time only (cu_core_executing) are both compared to the MATLAB simulation with BLAS and MKL in Table III. The MATLAB simulation runs on Intel Core Duo processor P8600 at.4ghz. The following formula is used to calculate GFLOPS of SMPA architecture for matrix multiplication of size n*n. GFLOPS = * [n / (# cycles of processing)] * (# FLOPS / cycle) * [(# cycles per second) / (10e-9)] For example, a -PE SMPA matrix multiplication for 51x51 matrices running at 00 MHz produces * (51^) / 4,445,45 * 0. = 1.08 GFLOPS. Table III indicates that SMPA computation time is comparable to the time reported from MATLAB simulation. It should mention that the MATLAB simulation runs at.4 GHz with peak performance of 19. GFLOPS, while the proposed SMPA design runs on -PE FPGA device at 00 MHz with peak performance of 1.8 GFLOPS (1). Moreover, as shown by Table I, the proposed SMPA architecture has very good scalability, that means it s easy to scale up to reach high performance using larger FPGA device and faster host-accelerator communication medium. Fig. 8 shows the processing time in logarithmic scale. Table IV shows the performance in GFLOPS and PE usage for matrix size from * to 104*104. As matrix size increases, the performance is close to the peak value which is 1.8 GFLOPS, or can be calculated as GFLOPS per PE per GHz. VII. CONCLUSIONS AND FUTURE WORK A. Conclusions In this paper, we analyzed and proposed a scalable macro-pipelined accelerator architecture for matrix multiplication. The basic idea is to exploit temporal parallelism, and achieve better scalability. We also designed and demonstrated the functionality of both 16-PE and -PE accelerator architecture through hardware implementation and behavioral simulation respectively. For hardware implementation, we selected Xilinx ML507 development board that contains Virtex-5 XC5VFX70T FPGA device, EMAC core and DDR SDRAM controller. With the proposed architecture, we achieved 1.18 GFLOPS with PEs on 104x104 matrix multiplication or about 1.90 GFLOPS per PE per GHz. The peak performance for the -PE design is 1.8 GFLOPS. It represents over 95% PE usage and good performance improvements with the scalable architecture. We also observed that computation task preformed by PEs is a dominating factor over the communication task from INF to MC and from MC to INF for larger matrix. For smaller matrix size, for example 51x51, communication is the dominating factor. Our experiment has demonstrated that the proposed SMPA has the capability of scaling up to tens and hundreds of GFLOPS by using multiple FPGA devices and high speed interconnect such as PCIe. B. Future Work In the future, we will implement 64 or more PEs accelerator with larger FPGA device and run larger scale matrix multiplication. Also, the next step is to exploit the use of PCIe or other high speed interconnects to reduce the communication overheads. Furthermore, SMPA is suitable for a wide range of applications. We plan to map the following key kernels of scientific applications to SMPA: Solution of Linear Equations Discrete Fourier Transform Graph Algorithms ACKNOWLEDGMENT We would like to acknowledge NSERC for funding and Xilinx for providing development boards. This work is also supported by NSFC under contract # , and the State High-Tech Development Plan of China under contract #009AA01Z10. We also thank a number of graduate students and researchers in the Computer Engineering Group of University of Toronto for design discussions. REFERENCES [1] Y. Dou, S. Vassiliadis, G. K. Kuzmanov and G. N. Gaydadjiev, 64- bit Floating-Point FPGA Matrix Multiplication, ACM/SIGDA 1th Intern. Symp. FPGAs, (Feb. 005) [] Piedra R.M., A Parallel Approach for Matrix Multiplication on the TMS0C4x DSP, SPRA107, pp. 1-4, February 1994 [] El-Atfy, R. Dessouky, M.A.; El-Ghitani, H.; Accelerating Matrix Multiplication on FPGAs, International Design and Test Workshop, pp. 0-04, December 007 [4] Bovay, K, Henderson, B., Lin, H.-Y. and Wadleigh, K, High Performance Computing Division, Hewlett Packer Company, pp. 1-6, 007 [5] Gustafson, J. L. and Greer, B. S. ClearSpeed Advance : A Hardware Accelerator for the Intel Math Kernel Library, pp 1-1, ClearSpeed Technology, 006, PN [6] Rousseaux, S., Hubaux, D., Guisset P., Legat, J.-P., A High Performance FPGA-Based Accelerator for BLAS Library Implementation, pp.1-10, RSSI July 007 [7] Picari, P., Corsonello, P., Perri, S., Cocorullo, G.; A matrix product accelerator for field programmable systems on chip, pp. 5-67, Microprocessors & Microsystems, 008 [8] Jang, J., Choi, S. B., Prasanna, V. K., Energy- and time-efficient matrix multiplication on FPGAs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp , November 005 [9] Jang, J., Choi, S. B., Prasanna, V. K., Area and Time Efficient Implementations of Matrix Multiplication on FPGAs, pp , FPT December 00 [10] Scott J. Campbell, Sunil P. Khatri, Resource and delay efficient matrix multiplication using newer FPGA devices, Proceedings of the 16th ACM Great Lakes symposium on VLSI, April 0-May 01, 006, Philadelphia, PA, USA [11] Xu, X., Ziavras, S. G. and Chang, T.-G., An FPGA-Based Parallel Accelerator for Matrix Multiplications in the Newton-Raphson Method, pp , ECU 005, Japan [1] Xilinx Application Note XAPP858 (v.1), DDR SDRAM Interface in Virtex-5 Devices, May 8, 008 5

Parallel graph traversal for FPGA

LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,