Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture

Size: px
Start display at page:

Download "Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture"

Transcription

1 009 International Conference on Reconfigurable Computing and FPGAs Matrix Multiplication based on Scalable Macro-Pipelined FPGA Accelerator Architecture Jiang Jiang 1 Vincent Mirian Kam Pui Tang Paul Chow Zuocheng Xing 1 1 School of Computer National University of Defense Technology Changsha, Hunan, P.R. China, Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada jiangjiangdr@hotmail.com, {mirianvi, tangkamp, pc}@eecg.utoronto.ca, zcxing@nudt.edu.cn Abstract In this paper, we introduce a scalable macropipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A -PE design for matrix size ranging from * to 104*104 is also simulated. Our experiment shows that we have achieved 1.18 GFLOPS with PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect. Keywords- matrix multiplication; temporal parallelism, macro-pipeline; FPGA accelerator I. INTRODUCTION The fundamental way to improve the performance of the computer system is to exploit the parallelism. There are two kinds of parallelism: spatial parallelism and temporal parallelism. Spatial parallelism tries to use duplicated function units (FUs), multiple cores or even multiprocessors to run at the same time on different data sets. Current researches mainly focus on it, such as superscalar, multicore, graphical processing unit (GPU) and multiprocessor system. Temporal parallelism tries to use multi-stage pipeline or macro-pipeline to partition application to multiply phases and data sets and run simultaneously. Clearly, temporal parallelism and spatial parallelism can yield the same potential speedup. Matrix multiplication is a typical routine/kernel for scientific application. The LINPACK Benchmark has been used for many years to evaluate computer systems. The LINPACK Basic Linear Algebra Subprograms (BLAS) are high quality building block routines performing basic vector and matrix operations. The popular Level BLAS mostly target matrix-matrix operations of order O(n ) [1]. There are attempts to improve the efficiency of the matrix multiplication algorithm. Coprocessors and accelerators have been implemented using GPU, field programmable gate arrays (FPGA), digital signal processing (DSP), and application specific integrated circuit (ASIC). Most implementations use spatial parallelism approach. Efficient algorithms have also been proposed and optimized libraries for specific computer architecture have been designed. The proposed Scalable Macro-pipelined architecture (SMPA) exploits temporal parallelism, not found in [1]. The implementation uses DSP to function as multiplieraccumulator (MAC) PE as described in [] and [], and uses distributed memory as optimal memory architecture as discussed in []. The algorithm uses many dimensions of parallelism, along the computation axis of the PE and the communication axis amongst multiple PEs. SMPA is a generic platform not optimized for any software library as in [4] and [5]. The algorithm does not use multiple FPGAs as in [6], however is scalable to multiple FPGAs. The utilization of registers is minimal in order to reduce energy consumption as argued in [7], [8] and [9]. Unlike [] and [11], a ring topology is used by SMPA for connecting PEs. And unlike [10], a front end control unit is designed to control PE operations. The remainder of this paper is organized as follows. In Section II, an overview of the proposed SMPA is presented. The matrix multiplication algorithm based on SMPA is described in Section III and the performance is analyzed in Section IV. Section V describes the proposed SMPA FPGA hardware implementation, and Section VI presents experimental results. Section VII concludes our discussions and introduces the future work. HOST Memory II. BWINF SCALABLE MACRO-PIPELINED ARCHITECTURE INF CU LMA0 BWto-LMR BWMEM MC PE0 Off-Chip Memory LMB0 PE1 PE Ring PE... BWto-PER... FPGA Accelerator PEp-1 LMA1 LMB1 LMA LMB LMAp-1 LMBp-1 LM Ring Figure 1. SMPA Accelerator As illustrated in Fig. 1, the accelerator system composes two parts: the host and the FPGA accelerator. The host offloads the multiplication task to the accelerator. Its responsibilities include data reordering of the multiplicand matrices, sending the reordered data to the accelerator, receiving the product data from the accelerator and data reordering of the product. The host is also responsible for writing and reading control registers in the accelerator to /09 $ IEEE DOI /ReConFig

2 configure execution modes and get performance statistic information respectively. The FPGA accelerator is used for matrix operations. It consists of five components: PE ring (PER), local memory ring (LMR), control unit (CU), host-accelerator interface (INF) and memory controller (MC). The PER includes multiple PEs, which are connected in ring structure and they operate in pipeline fashion. Each stage of the pipeline (or PE) can be a multiplier-accumulator (MAC) or a powerful microprocessor. Intermediate results are passed from one stage to the next stage. There are two reasons to use ring connection. First, it can support different matrix sizes. When the number of columns of matrix A (or the number of rows of matrix B) is larger than the number of PE, we need multiple iterations. Second, it s scalable. The ring topology is easy to scale up or down. Each PE has two local memories (LMs). One of them stores the data from matrix A, and another stores the data from matrix B. The LMs are also organized in ring structure. All of the data pumped into the LMR will be digested by LMs. The INF is the communication interface between the host and the accelerator. It means to operate at high speed so as not to create a bottleneck for the accelerator. The CU is used to control the source and the result matrix data flow. If communication bandwidth of the INF is not fast enough to meet the bandwidth requirement of the LMs, the source data from INF will be diverted to off-chip memory. After that, the CU will read the whole data out from the offchip memory and feed them to LMR. Similarly, result data will be written back to the off-chip memory before sending back to the host. Of course, if the bandwidth of the INF is fast enough, there is no need to use off-chip memory. The advantages of the proposed SMPA are: Effectively overlap communication and computation; start computation as early as possible. No extra memory to store the intermediate result. All of them are kept at PE s internal pipeline stages. Small on-chip local memory requirement. Data set for each PE is regular and can be reused. The interconnect topology is simple. Unidirectional ring topology is used to support various problem sizes. It adopts data flow fashion. Control logic is very simple. Once the pipeline is filled up, it can yield comparable performance as spatial parallelism. Scalable. Pipeline/ring is easy to scale up and down. III. MATRIX MULTIPLICATION BASED ON SMPA In general, the standard matrix multiplication, C = A*B, is defined as follows: n Ci, j = Ai, k * Bk, j (1) k = 1 Matrix multiplication has intrinsic property of pipeline. An efficient way to use pipeline architecture for matrix multiplication is that each stage (PE) of the macro-pipeline calculates the product of two elements from matrix A and B respectively, and then accumulates the intermediate result from previous stage. The matrix multiplication algorithm based on the SMPA architecture can be described with the following pseudo-code as shown in Fig.. for (c=0; c<= n By/m B-1; c++) // column # of B for (r=0; r<= n Ax/m A-1; r++) // row # of A for (j=0; j<= n/p-1; j++) // round # for (i=0; i <= p-1; i ++) // PE # for (d A= 0; d A <=m A-1; d A ++) // entry # of LMA for (d B= 0; d B <=m B-1; d B ++) // entry # of LMB C stg(m A *r+ d A, m B*c+ d B) (p*j+i) = A(m A *r+ d A, p*j+ i) // LMA, entry d A * B(p*j+ i, m B*c+ d B) // LMB, entry d B + C stg(m A *r+ d A, m B*c+ d B) (p*j+(i-1)) //from PE i-1 Figure. Matrix multiplication algorithm based on SMPA Let s assume that the number of PEs in PER is p, the number of pipeline stages in a PE is s. So, the total number of pipeline stages of the whole PER is p*s. The size of matrix A is n Ax *n, and the size of matrix B is n*n By. The depth of LMA is m A and that of LMB is m B. They are expressed in number of words. The byte size of the word is z. It is 8 for double precision FP and 4 for single precision FP. Essentially, this algorithm is a block matrix multiplication as illustrated in Fig.. For simplicity, we consider the matrix dimensions to be multiples of the submatrix dimensions. We can prove that the algorithm in Fig. is equal to (1). The outmost two loops (c, r) are loops for sub-matrices. The loop variable c refers to the number of sub-matrix in matrix B, and the loop variable r refers to the number of submatrix in matrix A. The size of sub-matrix A is m A *n, while the size of sub-matrix B is n*m B. The size of result submatrix C is thus m A *m B, as shown in Fig.. If n is larger than p, there needs to loop back n/p times. There are only p valid MAC operations in each round although there are p*s stages in the whole PER. To fill up the PER and to reduce to-lmr bandwidth, we reuse the data from matrix B for m A times. In this way, our approach reduces the bandwidth of matrix B reading by m A times. In Fig., the inner two loop variables (d A, d B ) refer to the entry number of LMA and LMB. The p*s cycles are defined as the unit round time, and we can get: p * s = m A * m B () In order to effectively use the data from matrix A and B, the host needs to reorder the source data to the order shown in Fig.. IV. PERFORMANCE ANALYSIS In Fig., loop variable i refers to the number of PEs. In each round, each PE calculates p*s MAC operations, and consumes all of the data in LMA and LMB. In the next round, all of data in LMA and LMB change to new data. Therefore, the round time is the main constrain. 49

3 Figure. SMPA Matrix Multiplication For simplicity of discussion and without loss of generality, we consider square matrix multiplication of n*n. A. Local Memory Capacity From (), m A = p*s/m B. So, the total LM capacity is: LM = z p *( m + p * s / m ), [byte] () total * B B B. Bandwidth As shown in Fig. 1, there are four bandwidth requirements for the proposed SMPA architecture: BW to-per : aggregate bandwidth from LMs to PEs BW to-lmr : aggregate bandwidth from off-chip memory or directly from INF to LMs BW MEM : bandwidth of off-chip memory BW INF : bandwidth between host and accelerator For the BW to-per, we have: BW to PER = * z * p, [byte/cycle] (4) BW to-per is very high comparing to BW to-lmr. In order to maintain the bandwidth, we exploit the data reuse in LM during each round. After each round (p*s cycles), all data in LMA and LMB should be renewed. So we get: BW to LMR = LM total /( p * s) = z *( mb / s + p / mb ), [byte/cycle] (5) If BW INF >= BW to-lmr, there is no need to use off-chip memory. If not, BW MEM should be equal or greater than BW to-lmr. Otherwise, PER will partially stall when any PE cannot get source data from LMA or LMB. C. Turnaround Time and Peak Performance From () and (5), in order to minimize BW to-lmr and LM total, we can get: m = m = p s (6) A B * The turnaround time T is defined as the time between the host sending source matrices data to the accelerator and the host receiving the result matrix data. If the implementation uses off-chip memory, we get: * z * n n * z * n z * n * n/ ma T = + max(( ),( + )), BWINF p BWto _ LMR BWto _ LMR [cycle] (7) If the source matrices data and result matrix data are sent directly from INF to the LMs, we get: n * z * n z * n * n/ ma T = max(( ),( + )), [cycle] (8) p BW BW to _ LMR to _ LMR In our case, the PEs are fast enough, and will consume all of the data that LMs feeds to PEs. Therefore, BW to_lmr or BW INF becomes the bottleneck of the accelerator. Consider (6), from (7) & (8), we get: + n/ p * s T = z * n *( + ), [cycle] (9) BW INF BW to _ LMR z * n n T = *( + ), [cycle] (10) BW p* s to _ LMR Assume that, for peak performance, BW to-lmr should be fast enough so that it is not the bottleneck. The computation time of the PEs thus becomes the dominating factor. From (7) and (8), the minimum turnaround time T min is: 50

4 T min * z * n n = +, [cycle] (11) BW p INF The other thread is responsible for receiving the product data from the accelerator and data reordering of the product. n Tmin =, [cycle] (1) p cu_per_ctrl So, the peak performance is shown in (1). For example, for a system of PEs running at 00 MHz, the peak performance of the system is 1.8 GFLOPS. 18b PERF * n = lim n T peak = * min p, [fp op/cycle] (1) cu_mc_ctrl 18b 18b 18b D. Scalability From () and (5), if s=16 and z=8, we can get: LM total = 8*p*(m B + p*16/m B ) bytes, and BW to-lmr = 8*(m B /16 + p/m B ) bytes/cycle. For a certain number of PEs (p) and pipeline stages (s), we can adjust m B (which affects m A as well by ()) to find the least BW to-lmr and LM total. As shown in Table I, the optimal choices happen to appear to be at the diagonal line, which are marked with the bold font. Table I indicates that SMPA has attractive scalability. Even in case of p=51, z=8 (double precision FP), LM total is only 768KB, and BW to-lmr is 96B/c. For this case, multiple FPGAs, high bandwidth memory controller or multiple lanes PCIe can be used to meet the computation resource and bandwidth requirements. Furthermore, SMPA uses ring topology, so it s very easy to scale up and down. Figure 4. Control Unit data path B. Control Unit The datapath of the CU is illustrated in Fig. 4. Currently, the CU supports modes: Source data are sent directly from INF to LMR, and the result data are sent back to host directly from PER through INF after processing. Source data are firstly sent from INF to MC and then read by LMR. After processing, the result data are temporarily written back to MC and then read by INF, then to host. C. PE Ring and LM Ring TABLE I. BWTO-LMR AND LM CAPACITY (S=16, Z=8) m B p=16 p= p=64 p=18 p=56 p= B/c, 4KB 4B/c, 1KB 40B/c, 40KB 7B/c, 144KB 16B/c, 544KB 64B/c, 11KB 0B/c, 5KB 4B/c, 1KB B/c, KB 48B/c, 96KB 80B/c, 0KB 144B/c, 115KB 64 4B/c, 8.5KB 6B/c, 18KB 40B/c, 40KB 48B/c, 96KB 64B/c, 56KB 96B/c, 768KB 18 65B/c, 16.5KB 66B/c, KB 68B/c, 68KB 7B/c, 144KB 80B/c, 0KB 96B/c, 768KB V. FPGA IMPLEMENTATION To verify our proposed architecture, we implement a - bit single precision floating point matrix multiplication FPGA accelerator using SMPA. A. Host and Accelerator s Interface The host is implemented with C code and compiled with no optimization. The host contains an Intel Core Duo processor P8600 (.4GHz) with MB L cache and GB DDR SDRAM running Ubuntu Gigabit communication medium is used between the host and the accelerator. There are two threads running on the machine. One thread is responsible for data reordering of the multiplicand matrices and sending the reordered data to the accelerator. Figure 5. PE and LM rings The architecture of PER is shown in Fig. 5. Each PE includes a multiplier and an accumulator. We use Virtex-5 DSP48E to implement these functional units. PE executes in data flow fashion. As depicted in Fig. 1, when the off-chip memory or host can not provide enough data for PER, the source data from LMA x / LMB x to PE x, or the intermediate result from previous stage may not be ready (data_valid). In this case, the PE x will be stalled. Since there are many PEs, it becomes very difficult to stall the whole PER at the same cycle. To solve this problem, we insert a stall buffer between PEs. When stall occurs in any given PE, the previous PE isn t stalled immediately but continues to send the intermediate results to the stall buffer, until the stall buffer is full (ring_stall). In this way, we isolate the stall domain to only one PE so that we can improve the frequency and performance. 51

5 The data width of LMR equals to the data width of the DDR SDRAM. LM is implemented using dual-port Block RAM (BRAM) so that LM can read source data from SDRAM and feed them to PE simultaneously. 18-bit data bus Control unit User I/F Write data FIFO Write/read address /command FIFO Clock generation Write data Controller Read data Clocks (clk0, clk90, clk180, clk0div) PHY layer 64-bit data bus DDR SDRAM Figure 6. DDR SDRAM memory controller interface [1] D. Memory Controller Xilinx DDR MC is used to transfer multiplicands data (A & B) and the result matrix (C) from and to external DDR off-chip memory modules. The total size of the memory on ML507 is 56MB. To improve memory performance, a burst length of bit words per cycles is used. The PHY module of the memory controller is responsible for the timing calibration to make sure the data and data strobes aligned for both write and read at a clock rate of 00 MHz. Fig. 6 depicts the major functional blocks and interface of the DDR SDRAM controller. VI. EXPERIMENTAL RESULTS A 16-PE -bit accelerator design operating at 15 MHz is implemented on Xilinx ML507 development board to verify functionality of the SMPA. Two test cases (* and 64*64) are generated from the host and fed through the gigabit Ethernet MAC (EMAC) interface into the FPGA hardware. Results of the matrix multiplication from the hardware are verified. The total turnaround time of the 16-PE SMPA for * and 64*64 matrix multiplication is 7.9ms and 7.68ms respectively. The reported turnaround time includes communication overheads of the EMAC interface and host program, such as data setup and retrieval times from EMAC core to SDRAM, and vice versa, and Linux kernel and operating system functional calls. TABLE II. SIMULATION TIME (IN NUMBER OF CYCLES) Matrix size * 64*64 18*18 56*56 51*51 104*104 inf_mem_writing 1,610 54,656 18, ,65,50,86 14,011,54 cu_core_executing 1,795 10,048 7, ,097 4,445,45 5,65,941 inf_mem_reading 4,06 7,99 10, ,4 1,988,574 7,965,447 total # cycles 19,441 9,10 41,19 1,96,091 9,96,85 57,4,91 TABLE III. MATRIX CALCULATION TIME (IN MS) Matrix size * 64*64 18*18 56*56 51*51 104*104 MATLAB BLAS MATLAB MKL smpa MHz smpa exec MHz TABLE IV. GFLOPS AND PE USAGE Matrix size * 64*64 18*18 56*56 51*51 104*104 MAC OP GFLOPS PE Usage 57.0% 81.6% 89.1% 9.6% 94.4% 95.% Due to FPGA resource limitation on ML507, a larger design with PEs is not implemented, but instead simulated to test the floating point matrix multiplication of matrix size ranging from * to 104*104. Test frequency is set at 00 MHz for the MC, CU, PER and LMR. The gigabit EMAC core however operates at 15 MHz which is generated internally by digital clock management (DCM) macro from a source clock of 100 MHz. The simulation cycles are broken into segments: 1. inf_mem_writing: Source matrices (A and B) writing from EMAC PHY, through INF to MC, then to SDRAM.. cu_core_executing: Source matrices reading, from SDRAM through MC to LMR, then to PER; Data processing (pipelined PEs); Result matrix (C) writing, from PER through MC to SDRAM.. inf_mem_reading: Result matrix reading from SDRAM through MC to INF, then to EMAC PHY. # Cycles Exec. time (ms) 1.00E E E E E E E E E+00 SMPA Matrix Multiplication ( PEs, 00MHz) total # cycles cu_core_executing inf_mem_writing inf_mem_reading Matrix size (Single precision FP) Figure 7. Simulation time (cycles) in logarithmic scale SMPA Matrix Multiplication ( vs. Matlab) smpa MHz MATLAB BLAS smpa exec MHz MATLAB MKL Matrix size (Single precision FP) Figure 8. Simulation time in logarithmic scale These simulation times for matrix size of * through 104*104 are shown in Table II. Fig. 7 shows the execution times in logarithmic scale. The total execution 5

6 time that includes data transfers (inf_mem_writing, cu_core_executing time and inf_mem_reading) and the PE execution time only (cu_core_executing) are both compared to the MATLAB simulation with BLAS and MKL in Table III. The MATLAB simulation runs on Intel Core Duo processor P8600 at.4ghz. The following formula is used to calculate GFLOPS of SMPA architecture for matrix multiplication of size n*n. GFLOPS = * [n / (# cycles of processing)] * (# FLOPS / cycle) * [(# cycles per second) / (10e-9)] For example, a -PE SMPA matrix multiplication for 51x51 matrices running at 00 MHz produces * (51^) / 4,445,45 * 0. = 1.08 GFLOPS. Table III indicates that SMPA computation time is comparable to the time reported from MATLAB simulation. It should mention that the MATLAB simulation runs at.4 GHz with peak performance of 19. GFLOPS, while the proposed SMPA design runs on -PE FPGA device at 00 MHz with peak performance of 1.8 GFLOPS (1). Moreover, as shown by Table I, the proposed SMPA architecture has very good scalability, that means it s easy to scale up to reach high performance using larger FPGA device and faster host-accelerator communication medium. Fig. 8 shows the processing time in logarithmic scale. Table IV shows the performance in GFLOPS and PE usage for matrix size from * to 104*104. As matrix size increases, the performance is close to the peak value which is 1.8 GFLOPS, or can be calculated as GFLOPS per PE per GHz. VII. CONCLUSIONS AND FUTURE WORK A. Conclusions In this paper, we analyzed and proposed a scalable macro-pipelined accelerator architecture for matrix multiplication. The basic idea is to exploit temporal parallelism, and achieve better scalability. We also designed and demonstrated the functionality of both 16-PE and -PE accelerator architecture through hardware implementation and behavioral simulation respectively. For hardware implementation, we selected Xilinx ML507 development board that contains Virtex-5 XC5VFX70T FPGA device, EMAC core and DDR SDRAM controller. With the proposed architecture, we achieved 1.18 GFLOPS with PEs on 104x104 matrix multiplication or about 1.90 GFLOPS per PE per GHz. The peak performance for the -PE design is 1.8 GFLOPS. It represents over 95% PE usage and good performance improvements with the scalable architecture. We also observed that computation task preformed by PEs is a dominating factor over the communication task from INF to MC and from MC to INF for larger matrix. For smaller matrix size, for example 51x51, communication is the dominating factor. Our experiment has demonstrated that the proposed SMPA has the capability of scaling up to tens and hundreds of GFLOPS by using multiple FPGA devices and high speed interconnect such as PCIe. B. Future Work In the future, we will implement 64 or more PEs accelerator with larger FPGA device and run larger scale matrix multiplication. Also, the next step is to exploit the use of PCIe or other high speed interconnects to reduce the communication overheads. Furthermore, SMPA is suitable for a wide range of applications. We plan to map the following key kernels of scientific applications to SMPA: Solution of Linear Equations Discrete Fourier Transform Graph Algorithms ACKNOWLEDGMENT We would like to acknowledge NSERC for funding and Xilinx for providing development boards. This work is also supported by NSFC under contract # , and the State High-Tech Development Plan of China under contract #009AA01Z10. We also thank a number of graduate students and researchers in the Computer Engineering Group of University of Toronto for design discussions. REFERENCES [1] Y. Dou, S. Vassiliadis, G. K. Kuzmanov and G. N. Gaydadjiev, 64- bit Floating-Point FPGA Matrix Multiplication, ACM/SIGDA 1th Intern. Symp. FPGAs, (Feb. 005) [] Piedra R.M., A Parallel Approach for Matrix Multiplication on the TMS0C4x DSP, SPRA107, pp. 1-4, February 1994 [] El-Atfy, R. Dessouky, M.A.; El-Ghitani, H.; Accelerating Matrix Multiplication on FPGAs, International Design and Test Workshop, pp. 0-04, December 007 [4] Bovay, K, Henderson, B., Lin, H.-Y. and Wadleigh, K, High Performance Computing Division, Hewlett Packer Company, pp. 1-6, 007 [5] Gustafson, J. L. and Greer, B. S. ClearSpeed Advance : A Hardware Accelerator for the Intel Math Kernel Library, pp 1-1, ClearSpeed Technology, 006, PN [6] Rousseaux, S., Hubaux, D., Guisset P., Legat, J.-P., A High Performance FPGA-Based Accelerator for BLAS Library Implementation, pp.1-10, RSSI July 007 [7] Picari, P., Corsonello, P., Perri, S., Cocorullo, G.; A matrix product accelerator for field programmable systems on chip, pp. 5-67, Microprocessors & Microsystems, 008 [8] Jang, J., Choi, S. B., Prasanna, V. K., Energy- and time-efficient matrix multiplication on FPGAs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp , November 005 [9] Jang, J., Choi, S. B., Prasanna, V. K., Area and Time Efficient Implementations of Matrix Multiplication on FPGAs, pp , FPT December 00 [10] Scott J. Campbell, Sunil P. Khatri, Resource and delay efficient matrix multiplication using newer FPGA devices, Proceedings of the 16th ACM Great Lakes symposium on VLSI, April 0-May 01, 006, Philadelphia, PA, USA [11] Xu, X., Ziavras, S. G. and Chang, T.-G., An FPGA-Based Parallel Accelerator for Matrix Multiplications in the Newton-Raphson Method, pp , ECU 005, Japan [1] Xilinx Application Note XAPP858 (v.1), DDR SDRAM Interface in Virtex-5 Devices, May 8, 008 5

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

High-Performance Linear Algebra Processor using FPGA

High-Performance Linear Algebra Processor using FPGA High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible

More information

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Kiran Kumar Matam Computer Science Department University of Southern California Email: kmatam@usc.edu Hoang Le and Viktor K.

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

Sparse Matrix-Vector Multiplication FPGA Implementation

Sparse Matrix-Vector Multiplication FPGA Implementation UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be

More information

A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS

A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS 2011 21st International Conference on Field Programmable Logic and Applications A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS Colin Yu Lin, Hayden Kwok-Hay So Electrical and Electronic Engineering

More information

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items (ULFFT) November 3, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E-mail: info@dilloneng.com URL: www.dilloneng.com Core

More information

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic

More information

High Speed Data Transfer Using FPGA

High Speed Data Transfer Using FPGA High Speed Data Transfer Using FPGA Anjali S S, Rejani Krishna P, Aparna Devi P S M.Tech Student, VLSI & Embedded Systems, Department of Electronics, Govt. Model Engineering College, Thrikkakkara anjaliss.mec@gmail.com

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Design of memory efficient FIFO-based merge sorter

Design of memory efficient FIFO-based merge sorter LETTER IEICE Electronics Express, Vol.15, No.5, 1 11 Design of memory efficient FIFO-based merge sorter Youngil Kim a), Seungdo Choi, and Yong Ho Song Department of Electronics and Computer Engineering,

More information

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Jai Prakash Mishra 1, Mukesh Maheshwari 2 1 M.Tech Scholar, Electronics & Communication Engineering, JNU Jaipur,

More information

Sparse Linear Solver for Power System Analyis using FPGA

Sparse Linear Solver for Power System Analyis using FPGA Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

Matrix Multiplication Implementation in the MOLEN Polymorphic Processor

Matrix Multiplication Implementation in the MOLEN Polymorphic Processor Matrix Multiplication Implementation in the MOLEN Polymorphic Processor Wouter M. van Oijen Georgi K. Kuzmanov Computer Engineering, EEMCS, TU Delft, The Netherlands, http://ce.et.tudelft.nl Email: {w.m.vanoijen,

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

H-SIMD Machine: Configurable Parallel Computing for Matrix Multiplication*

H-SIMD Machine: Configurable Parallel Computing for Matrix Multiplication* H-SIMD Machine: Configurable Parallel Computing for Matrix Multiplication* Xizhen Xu and Sotirios G. Ziavras Department of Electrical and Computer Engineering New Jersey Institute of Technology Newark,

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double

More information

Sparse LU Decomposition using FPGA

Sparse LU Decomposition using FPGA Sparse LU Decomposition using FPGA Jeremy Johnson 1, Timothy Chagnon 1, Petya Vachranukunkiet 2, Prawat Nagvajara 2, and Chika Nwankpa 2 CS 1 and ECE 2 Departments Drexel University, Philadelphia, PA jjohnson@cs.drexel.edu,tchagnon@drexel.edu,pv29@drexel.edu,

More information

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner

RiceNIC. Prototyping Network Interfaces. Jeffrey Shafer Scott Rixner RiceNIC Prototyping Network Interfaces Jeffrey Shafer Scott Rixner RiceNIC Overview Gigabit Ethernet Network Interface Card RiceNIC - Prototyping Network Interfaces 2 RiceNIC Overview Reconfigurable and

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Mapping Sparse Matrix-Vector Multiplication on FPGAs Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 University of Tennessee, Knoxville 1 Oak Ridge National Laboratory 2 [jsun5, gdp]@utk.edu 1, Olaf@ornl.gov

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices Introduction White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices One of the challenges faced by engineers designing communications equipment is that memory devices

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong

AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4. Bas Breijer, Filipa Duarte, and Stephan Wong AN OCM BASED SHARED MEMORY CONTROLLER FOR VIRTEX 4 Bas Breijer, Filipa Duarte, and Stephan Wong Computer Engineering, EEMCS Delft University of Technology Mekelweg 4, 2826CD, Delft, The Netherlands email:

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Supporting the Linux Operating System on the MOLEN Processor Prototype

Supporting the Linux Operating System on the MOLEN Processor Prototype 1 Supporting the Linux Operating System on the MOLEN Processor Prototype Filipa Duarte, Bas Breijer and Stephan Wong Computer Engineering Delft University of Technology F.Duarte@ce.et.tudelft.nl, Bas@zeelandnet.nl,

More information

Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol.

Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol. Copyright 2007 Society of Photo-Optical Instrumentation Engineers. This paper was published in Proceedings of SPIE (Proc. SPIE Vol. 6937, 69370N, DOI: http://dx.doi.org/10.1117/12.784572 ) and is made

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Understanding Peak Floating-Point Performance Claims

Understanding Peak Floating-Point Performance Claims white paper FPGA Understanding Peak ing-point Performance Claims Learn how to calculate and compare the peak floating-point capabilities of digital signal processors (DSPs), graphics processing units (GPUs),

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

An Advanced Graph Processor Prototype

An Advanced Graph Processor Prototype An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant

More information

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

Godson Processor and its Application in High Performance Computers

Godson Processor and its Application in High Performance Computers Godson Processor and its Application in High Performance Computers Weiwu Hu Institute of Computing Technology, Chinese Academy of Sciences Loongson Technologies Corporation Limited hww@ict.ac.cn 1 Contents

More information

A flexible memory shuffling unit for image processing accelerators

A flexible memory shuffling unit for image processing accelerators Eindhoven University of Technology MASTER A flexible memory shuffling unit for image processing accelerators Xie, R.Z. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

On the parallelization of slice-based Keccak implementations on Xilinx FPGAs

On the parallelization of slice-based Keccak implementations on Xilinx FPGAs On the parallelization of slice-based Keccak implementations on Xilinx FPGAs Jori Winderickx, Joan Daemen and Nele Mentens KU Leuven, ESAT/COSIC & iminds, Leuven, Belgium STMicroelectronics Belgium & Radboud

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures

Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Technical Report Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Andreas Gerstlauer Robert A. van de Geijn UT-CERC-12-02 October 5, 2011 Computer Engineering

More information

A thesis submitted to the Graduate Program in Electrical & Computer Engineering. in conformity with the requirements for the

A thesis submitted to the Graduate Program in Electrical & Computer Engineering. in conformity with the requirements for the ANALYSIS-DRIVEN DESIGN OF PARALLEL FLOATING-POINT MATRIX MULTIPLICATION FOR IMPLEMENTATION IN RECONFIGURABLE LOGIC BY AHMAD KHAYYAT A thesis submitted to the Graduate Program in Electrical & Computer Engineering

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

On the efficiency of the Accelerated Processing Unit for scientific computing

On the efficiency of the Accelerated Processing Unit for scientific computing 24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

User Manual for FC100

User Manual for FC100 Sundance Multiprocessor Technology Limited User Manual Form : QCF42 Date : 6 July 2006 Unit / Module Description: IEEE-754 Floating-point FPGA IP Core Unit / Module Number: FC100 Document Issue Number:

More information

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

Architecture without explicit locks for logic simulation on SIMD machines

Architecture without explicit locks for logic simulation on SIMD machines Architecture without explicit locks for logic on machines M. Chimeh Department of Computer Science University of Glasgow UKMAC, 2016 Contents 1 2 3 4 5 6 The Using models to replicate the behaviour of

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information