Perspectives for the Use of Field Programmable Gate Arrays for Finite Element Computations
|
|
- Rosalind Fitzgerald
- 6 years ago
- Views:
Transcription
1 Perspectives for the se of Field Programmable Gate Arrays for Finite Element Computations Gerhard ienhart 1, Daniel Gembris 1, Reinhard Männer 1 1 niversität Mannheim, {lienhart,gembris,maenner}@ti.uni-mannheim.de Abstract We have studied how the solution of partial differential equations by means of finite element methods could be accelerated using Field Programmable Gate Arrays (FPGAs). First, we discuss in general the capabilities of current FPGA technology for floating-point implementations of number crunching. Based on practical results for basic floating-point operators performance limits are outlined. Then the perspectives for the implementation of decomposition with a state-of-the-art FPGA chip are addressed. It is estimated that, compared with a modern CP, a speedup by a factor of can be expected using a single off-the-shelf FPGA. 1 Introduction In the last years there have been strong activities in the field of reconfigurable computing. In the focus of this emerging branch of computing are FPGAs, highly integrated semiconductor chips whose function can be freely specified on a gate-level in contrast to conventional CPs. Given a suitable computational problem, the design flexibility of these chips allows massive parallel computations, which can by far overcompensate the intrinsic disadvantage of lower clock speeds, being O(1/10) compared to state-of-the-art CPs (Compton & Hauck, 2002). One class of problems that could clearly benefit from the acceleration achievable with FPGAs is real-time FEM, a technique that is of relevance for virtual reality application, like medical surgery simulation or interactive engineering (Rhomberg et al. 1998, Bro-Nielsen 1998, Margetts 2005), or simulations in the context of robot control. In general, virtual reality applications require that any action of the user is followed by an immediate response of the simulation system, i.e. within less than about 20ms. An example would be that the user applies a force to a virtual object, which should result in a direct feedback mediating an impression of the evoked object deformation (mainly in terms of vision and haptics, provided by visual and tactile/haptic displays). While the demands on computation speed are high, the demands on accuracy are only moderate, since in a virtual reality application a qualitative agreement between the simulation and the corresponding real physical process is mostly sufficient. These characteristics make the use of FPGAs very attractive for solving the underlying FEM problems due to the high degree of parallelism and the reduction in bit width of the floating-point numbers possible. Figure 1 shows a schematic representation of an FPGA. The chip essentially consists of an array of configurable logic blocks interconnected by a programmable network. The logic blocks provide basic combinatorial logic realized by small look-up tables (Ts), flip-flops, and are usually enhanced with additional features for efficient data buffering and arithmetic functionality. Modern FPGAs contain many thousands of programmable logic blocks. For example, the largest of the new Virtex-4 FPGAs, XC4VFX140, manufactured by XIINX has slices, where a slice contains two Ts and two flip-flops. Modern FPGAs also comprise hard-wired circuits like multipliers and memories. For the above mentioned FPGA there are kbit memory elements and 192 so-called DSP48 cells which contain an 18 bit multiplier and additional logic for efficiently composing more complex integer operators, e.g.
2 multiply/add operators with a higher bit width. With modern FPGAs complex system on a chip (SoC) architectures can be built. These systems may consist of dedicated processing units as well as programmable units like soft-processors (e.g. MicroBlaze processor core, XIINX 2005). With the advent of modern FPGA technology it became possible to design custom computing machines without the need to construct application specific integrated circuits (ASICs). A common approach to make FPGA technology available for applications is to build a coprocessor board which can be installed in a host computer. The right side of Fig. 1 shows such a board. It has been developed at the niversity of Mannheim as a prototype system for different applications ranging from image processing to high-performance computing for astrophysical simulations. Configurable ogic Block Input/Output Block Configurable Interconnect Fig. 1. Schematic representation of an FPGA (left) and typical FPGA based PCI board (right). 2 Basics for FEM processing with FPGAs While for many areas, where FPGAs are already established (e.g. for signal processing), fixed-point number arithmetic is sufficient, number crunching applications usually require floating-point computations. Just a few years ago FPGAs became available, which are suitable for implementing complex calculation units for floating-point arithmetic. Related implementation issues have already been studied to some extent (Belanovic & eeser 2002, Roesler & Nelson 2002, nderwood 2004). Currently there are steadily increasing activities in the scientific and engineering community to apply reconfigurable architectures to problems of high-performance computing. For floating-point based high-performance computing see e.g. ienhart et al and Hamada & Nakasato Figure 2 and Tab. 1 demonstrate the relation between precision and FPGA logic utilization for different operators. imiting the precision requirements allows for highly parallel calculation units. The above mentioned FPGA XC4VFX140 may e.g. contain up to 150 adders and 50 multipliers with single-precision or 60 adders and 20 multipliers with double-precision. When only using multiply-add units we can estimate a peak performance of 20 GFOPs for single-precision (50 units) or 8 GFOPs for double-precision (20 units). Number of elements x Mult18 4 x Mult18 Multiplier Slices Adder Slices 0.5 % of XC4FX Mantissa width Fig. 2. Scaling of logic consumption of adders and multipliers from the niversity of Mannheim for different mantissa width (single-precision corresponds to 24 bit mantissa, double-precision to 53 bit).
3 Tab. 1. ogic consumption, speed and latency (clock cycles) of different single-precision (SP) and double-precision (DP) floating-point operators for XC4VFX-10 FPGAs with libraries from niversity of Mannheim and XIINX Inc. (information for XIINX operators according to data sheet, XIINX 2005). niversity of Mannheim XIINX Inc. Operator Virtex 4 DSP48 Speed at. Operator Virtex 4 DSP48 Speed at. Slices Cells (MHz) Slices Cells (MHz) SP Adder SP Adder SP Multiplier SP Multiplier SP Divider SP Divider SP Accumulator DP Adder DP Multiplier DP Multiplier DP Divider Target algorithms Generally the finite element method leads to a set of coupled partial differential equations which can be approximated by a system of linear equations A x = b. sually the Matrix A is sparse and, depending on the type of underlying problem, exhibits some additional properties like symmetry or diagonal dominance. For small problem size, the direct solution via decomposition is generally applicable. A standard approach in these cases is the use of multifrontal methods like those implemented in MFPACK (Davis & Duff 1997) explained below. The decomposition method solves a linear system by forward substitution with a lower () and then backward substitution with an upper () triangular matrix: A x = ( ) x = b y = b x = y A simple algorithm for performing the decomposition of an n x n Matrix is Crout's algorithm which successively performs following calculations for j = 1K n : ii 1 = A 1 = jj A i 1 k= 1 ik j 1 k= 1 kj ik kj i = 1K j i = j + 1K n In general, decomposition is an algorithm of complexity O(n 3 ) and the forward and backward substitution has complexity O(n 2 ). But the number of non-zero coefficients in sparse matrices for FEM usually scales with O(n) instead of O(n 2 ) in the general case, which may reduce the effort dramatically. For optimizing memory requirements and access pattern, the multifrontal method was developed (for a review see iu 1992). The method organizes the numerical factorization into a series of steps which involve partial factorization of dense smaller matrices (frontal matrices). The steps are processed according to an elimination tree structure, which provides a natural framework for parallel computations. The method allows for better data locality, efficient outof-core schemes, and avoids indirect addressing within the factorization steps. decomposition can be widely used and is the default method in FEMAB for 1D and 2D problems. Even when iterative solvers are more appropriate, (incomplete) factorization can be used as a preconditioner (but algebraic multigrid methods are often preferred).
4 2.2 Related work Previous work about FPGA implementations of linear solvers mainly focuses on direct decomposition with algorithms, which are simple to handle. Efficient algorithms which may be applied more generally like multifrontal methods with pivoting and complex submatrix handling are still subject to research. It has already been studied how basic linear algebra processing can be realized with FPGAs like matrix-vector multiplication (for recent work see e.g. nderwood & Hemmert, 2004, Zhuo & Prasanna 2005 and Deorimier & DeHon 2005). Delorimier presents e.g. a dedicated processing unit for matrix-vector multiplication with double-precision, which achieves a sustained performance of 1.5 GFOPs on a single FPGA (XC2V6000-4). nderwood provides trends and performance comparisons for different BAS library operations on FPGAs and CPs. For example, a performance of 4 GFOPs on a single FPGA (XC2VP100-6) for matrix multiplication at double-precision was reported. Concerning decomposition there are FPGA implementations for different approaches of block-based algorithms (Choi & Prasanna 2003, Daga et al. 2004, Wang & Ziavras 2004). To our knowledge in the previous work no pivoting is applied, which limits the applicability to special problems. The design of Wang and Ziavras was developed for equation systems which can be represented by bordered-diagonal-block sparse matrices. A multiprocessor system on a single FPGA has been implemented, but unfortunately the performance was not compared to standard CPs. Daga et al. present a decomposition for large-sized matrices (n = 1000) which is not optimized for sparse systems. For their test case they claim a speed-up of 23x in total computing time with double-precision calculation on a very large FPGA (Virtex-2 Pro XC2VP125) compared to a Pentium M, 1.4 GHz CP. 3 Proposed architecture We propose an architecture with multiple equal processing elements (), each consisting of a dedicated computing element for linear algebra processing with local memory controlled by a soft-processor. The design of the computing element is optimized for the basic linear algebra processing tasks like vector dot product, which are required for decomposition. The basic computational step for these tasks is a multiply-accumulation operation, to which the primary building blocks are hence dedicated. The left side of Fig. 3 shows the structure of a, which can be used in a flexible manner due to the multiplexers between the operators. This architecture allows independent multiplication and addition as well as vectorized multiply-accumulate operations. In order to provide a sustained performance close to peak performance a dedicated accumulator is used in addition to the adder. Our floating-point accumulator separately accumulates positive and negative values. Therefore a final add must be performed at the end of an accumulation, wherefore the loop-back from the accumulator output to the inputs of the adder is routed. Basic linear algebra operations are done automatically, controlled by the ATC unit. The data flow is controlled by the Operand Fetch and Result Dispatch units. The operations are performed on local memory (internal FPGA dual-port block-ram). The attached processor core cares for data management and processing control by assembling data in the memory and triggering operations for the ATC unit. sing a processor for this purpose provides the flexibility to apply enhanced factorization algorithms like the multifrontal method. We propose to use the XIINX MicroBlaze soft-processor. With an on-chip-memory interface, multiple special point-to-point communication channels to attached circuits and an on-chip peripheral bus to connect to the environment, this core provides all necessary interfaces to assemble the s as drawn.
5 At the right side of Fig. 3 the overall architecture of the proposed FPGA design is shown. Several units share a single divider as this operation is rarely needed. The s are connected with each other and the off-chip memory as well as with the host interface via the onchip peripheral bus of the processor cores. The number m of s which share a divider is a parameter and depends on the target application, where m=8 would be a reasonable choice. The detailed design of the interconnection between s and the host and memory interfaces would depend on the respective reconfigurable platform. I/O Dual Port Mem Processor Core Operand Fetch x + ACC Result Dispatch inear Algebra Task Control (ATC) 0 1/x m-1 Off-Chip Memory Interface n-m 1/x n-1 Host Interface Processing Element () Fig. 3. Proposed processing element (left) and overall architecture (right). 4 Performance analysis With the above outlined design we can now estimate the performance achievable in FEM processing. A single calculation unit consisting of a multiplier, an adder and an accumulator consumes about 1000 slices and 4 DSP48 cells. The MicroBlaze soft-processor adds about 500 slices and we can charge another 200 slices for control and data management. Altogether a will require about 1700 slices. Therefore the mentioned FPGA XCVFX140 may house 32 processing elements including 8 single-precision dividers. The on-chip memory allows local dual-port buffers of 38 KBytes per. Taking into account, that a significant fraction of this memory is used for instruction memory for the processor core, this amount of memory limits a to handle submatrices with a size of up to about 70x70. Calculations with larger matrices need to be divided among more than one. Assuming a conservative clock frequency of 160 MHz and two concurrent floating-point operations per we get a peak performance of 10 GFOPs on a single FPGA. Estimating an efficiency of more than 50 %, which is reasonable for custom computing architectures, we get a sustained performance of about 5 GFOPs. Assuming 300 MFOPs for a general-purpose CP, which is realistic as we usually see an efficency of less than 10 % for the solution of sparse systems, we can expect a speedup of When going to double-precision a single processing element would consume about 3300 slices and 9 DSP48 cells. Therefore only 16 s will fit on the same FPGA. Estimating the performance in the same way as above, a peak performance of 5 GFOPs and an estimated speedup of 5-10 results. About 25 % of logic ressources could be saved by using the adder in the also for accumulation. Because of the latency of the adder, performing the multiply-accumulate operations in an efficient way would become much more complex and would require an additional feedback FIFO between the output of the adder and one of its inputs. Then it was also more difficult to get close to peak performance, i.e. the algorithm had to allow a much higher parallelism in order to hide the adder latency. But such a design would enable a 48-fold parallelization with a peak performance of more than 15 GFOPs.
6 5 Conclusions We demonstrated the potential of reconfigurable computing for solving the linear equations from FEM problems. A dedicated FPGA design consisting of multiple entities of equal processing elements was outlined and the resulting performance of 10 GFOPs for singleprecision estimated. Because of the integrated processor cores the architecture is capable of dealing with different algorithms for decomposition. The design should also provide good performance for other solvers like those based on the method of conjugate gradient (iterative) or algebraic multigrid solvers. The efficiency and therefore the overall speedup depends on how efficiently the underlying algorithms can be parallelized to perform independent basic linear algebra operations on local memory. For FPGA computing machines this is a major topic of research and the prospects of computation power for number crunching strongly motivates further work in this direction. References Belanovic, P.; eeser, M. (2002) A ibrary of Parameterized Floating-Point Modules and Their se. Proc. FP'02, pp Bro-Nielsen, M. (1998) Finite element modeling in surgery simulation. Proc. of the IEEE 86(3): Choi, S.; Prasanna, V.K. (2003) Time and Energy Efficient Matrix Factorization using FPGAs. Proc. of FP'03. Compton, K.; Hauck, S. (2002) Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34(2): Davis, T.A.; Duff, I. (1997) An nsymmetric-pattern Multifrontal Method for Sparse Factorization. SIAM J. Matrix Anal. Appl. 18(1): Deorimier, M.; DeHon, A. (2005) Floating-Point Sparse Matrix-Vector Multiply for FPGAs. Proc. FPGA'05. Hamada, T.; Nakasato, N (2005) PGR: A Software Package for Reconfigurable Super-Computing, Proc. FP'05, pp Margetts,.; Smethurst, C.; Ford, R. (2005) Interactive Finite Element Analysis. NAFEMS World Congress, Malta, May Johnson, J.R.; Nagvarjara, P.; Nwankpa, C. (2004) Sparse inear Solver for Power System Analysis using FPGA. Proc. HC'04. Heath, M.T. (1997) Parallel direct methods for sparse linear systems. Parallel Numerical Algorithms, Kluwer, Boston, pp ienhart, G.; Kugel, A.; Männer, R. (2002) sing Floating Point Arithmetic on FPGAs for Accelerating Scientific N-Body Simulations. Proc. FCCM'02. iu, J. (1992) The Multifrontal Method for Sparse Matrix Solution: Theory and Practice. SIAM Review 34: Rhomberg, A.; Enzler, R.; Thaler, M.; Tröster, G. (1998) Design of a FEM Computation Engine for Real-Time aparoscopic Surgery Simulation. Proc. of the IPPS/SPDP, pp Roesler, E.; Nelson, B.E. (2002) Novel Optimizations for Hardware Floating-Point nits in a Modern FPGA Architecture. Proc. FP'02, pp nderwood, K. (2004) FPGAs vs. CPs: Trends in Peak Floating-Point Performance, K. nderwood, Proc. FPGA'04. nderwood, K.; Hemmert K.S. (2004) Closing the Gap: CP and FPGA Trends in Sustainable Floating-Point BAS Performance. Proc. FCCM'04. Wang, X.; Ziavras, S.G. (2004) Parallel factorization of sparse matrices on FPGA-based configurable computing engines. Concurrency Computat.: Pract. Exper. 16: XIINX Inc. (2005) Floating-Point Operator V1.0 Product Specification. DS335: XIINX Inc. (2005) MicroBlaze Processor Reference Guide. Zhuo,.; Prasanna, V.K. (2005) Sparse Matrix-Vector Multiplication on FPGAs. Proc. FPGA'05. Xu, X.; Ziavras, S.G. (2003) Iterative Methods for Solving inear Systems of equations on FPGA- Based Machines. Proc. Computers and their Applications, pp
Sparse LU Decomposition using FPGA
Sparse LU Decomposition using FPGA Jeremy Johnson 1, Timothy Chagnon 1, Petya Vachranukunkiet 2, Prawat Nagvajara 2, and Chika Nwankpa 2 CS 1 and ECE 2 Departments Drexel University, Philadelphia, PA jjohnson@cs.drexel.edu,tchagnon@drexel.edu,pv29@drexel.edu,
More informationPERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1
PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 J. Johnson, P. Vachranukunkiet, S. Tiwari, P. Nagvajara, C. Nwankpa Drexel University Philadelphia, PA Abstract Full-AC load flow constitutes
More informationTowards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing
Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de
More informationFPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH
FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,
More informationINTRODUCTION TO FPGA ARCHITECTURE
3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)
More informationFPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression
FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationEmbedded Floating-Point Units in FPGAs
Embedded Floating-Point Units in FPGAs Michael J. Beauchamp Scott Hauck Keith D. Underwood K. Scott Hemmert University of Washington Sandia National Laboratories* Dept. of Electrical Engineering Scalable
More informationBasic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices
3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationSimulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture
Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Jai Prakash Mishra 1, Mukesh Maheshwari 2 1 M.Tech Scholar, Electronics & Communication Engineering, JNU Jaipur,
More informationFloating-Point Matrix Product on FPGA
Floating-Point Matrix Product on FPGA Faycal Bensaali University of Hertfordshire f.bensaali@herts.ac.uk Abbes Amira Brunel University abbes.amira@brunel.ac.uk Reza Sotudeh University of Hertfordshire
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationSparse Linear Solver for Power System Analyis using FPGA
Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationAnalysis of High-performance Floating-point Arithmetic on FPGAs
Analysis of High-performance Floating-point Arithmetic on FPGAs Gokul Govindu, Ling Zhuo, Seonil Choi and Viktor Prasanna Dept. of Electrical Engineering University of Southern California Los Angeles,
More informationA Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns
A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns Xinying Wang, Phillip H. Jones and Joseph Zambreno Department of Electrical and Computer Engineering Iowa State
More informationParallel FIR Filters. Chapter 5
Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture
More informationPricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation
Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation Prof. Dr. Joachim K. Anlauf Universität Bonn Institut für Informatik II Technische Informatik Römerstr. 164 53117 Bonn E-Mail: anlauf@informatik.uni-bonn.de
More informationSignal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University
Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage silage@temple.edu ECE Temple University www.temple.edu/scdl Signal Processing Algorithms into Fixed Point FPGA Hardware Motivation
More informationExploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA
More informationField Programmable Gate Array (FPGA)
Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems
More informationMapping Sparse Matrix-Vector Multiplication on FPGAs
Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 University of Tennessee, Knoxville 1 Oak Ridge National Laboratory 2 [jsun5, gdp]@utk.edu 1, Olaf@ornl.gov
More informationAn Integrated Reduction Technique for a Double Precision Accumulator
An Integrated Reduction Technique for a Double Precision Accumulator Krishna K. Nagar Dept. of Computer Sci.and Engr. University of South Carolina Columbia, SC 29208 USA nagar@cse.sc.edu Yan Zhang Dept.
More informationDigital Integrated Circuits
Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market
More informationPerformance Modeling of Pipelined Linear Algebra Architectures on FPGAs
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,
More informationA MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS
2011 21st International Conference on Field Programmable Logic and Applications A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS Colin Yu Lin, Hayden Kwok-Hay So Electrical and Electronic Engineering
More informationERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC
ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC Chun Te Ewe, Peter Y. K. Cheung and George A. Constantinides Department of Electrical & Electronic Engineering,
More informationEITF35: Introduction to Structured VLSI Design
EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah Rakesh.gangarajaiah@eit.lth.se Slides from Chenxin Zhang and Steffan Malkowsky WWW.FPGA What is FPGA? Field
More informationPerformance Analysis of CORDIC Architectures Targeted by FPGA Devices
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Performance Analysis of CORDIC Architectures Targeted by FPGA Devices Guddeti Nagarjuna Reddy 1, R.Jayalakshmi 2, Dr.K.Umapathy
More information3.1 Description of Microprocessor. 3.2 History of Microprocessor
3.0 MAIN CONTENT 3.1 Description of Microprocessor The brain or engine of the PC is the processor (sometimes called microprocessor), or central processing unit (CPU). The CPU performs the system s calculating
More informationEvaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs
Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Kiran Kumar Matam Computer Science Department University of Southern California Email: kmatam@usc.edu Hoang Le and Viktor K.
More informationINTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)
INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) Bill Jason P. Tomas Dept. of Electrical and Computer Engineering University of Nevada Las Vegas FIELD PROGRAMMABLE ARRAYS Dominant digital design
More informationPerformance and accuracy of hardware-oriented. native-, solvers in FEM simulations
Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices
More informationMultiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard
Multiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard Vítor Silva 1,RuiDuarte 1,Mário Véstias 2,andHorácio Neto 1 1 INESC-ID/IST/UTL, Technical University of Lisbon,
More informationL2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA
L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA 18-545: FALL 2014 2 Admin stuff Project Proposals happen on Monday Be prepared to give an in-class presentation Lab 1 is
More informationIEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers
International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double
More informationGPU vs FPGA : A comparative analysis for non-standard precision
GPU vs FPGA : A comparative analysis for non-standard precision Umar Ibrahim Minhas, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London
More informationFPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)
FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor
More informationMULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA
MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical
More informationField Program mable Gate Arrays
Field Program mable Gate Arrays M andakini Patil E H E P g r o u p D H E P T I F R SERC school NISER, Bhubaneshwar Nov 7-27 2017 Outline Digital electronics Short history of programmable logic devices
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to
More informationMulti MicroBlaze System for Parallel Computing
Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need
More informationFPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY
FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationFPGA architecture and design technology
CE 435 Embedded Systems Spring 2017 FPGA architecture and design technology Nikos Bellas Computer and Communications Engineering Department University of Thessaly 1 FPGA fabric A generic island-style FPGA
More informationFPGA: What? Why? Marco D. Santambrogio
FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much
More informationThe DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK
The DSP Primer 8 FPGA Technology Return DSPprimer Home Return DSPprimer Notes August 2005, University of Strathclyde, Scotland, UK For Academic Use Only THIS SLIDE IS BLANK August 2005, For Academic Use
More informationA High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation Antonio Roldao Lopes and George A Constantinides Electrical & Electronic Engineering, Imperial College London, London SW7 2BT,
More informationA Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms
A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,
More informationOutline. Field Programmable Gate Arrays. Programming Technologies Architectures. Programming Interfaces. Historical perspective
Outline Field Programmable Gate Arrays Historical perspective Programming Technologies Architectures PALs, PLDs,, and CPLDs FPGAs Programmable logic Interconnect network I/O buffers Specialized cores Programming
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationInternational Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering
An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm Pallavi Ramteke 1, Dr. N. N. Mhala 2, Prof. P. R. Lakhe M.Tech [IV Sem], Dept. of Comm. Engg., S.D.C.E, [Selukate],
More informationReconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization
Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Sashisu Bajracharya, Deapesh Misra, Kris Gaj George Mason University Tarek El-Ghazawi The George Washington
More informationMatrix Multiplication Implementation in the MOLEN Polymorphic Processor
Matrix Multiplication Implementation in the MOLEN Polymorphic Processor Wouter M. van Oijen Georgi K. Kuzmanov Computer Engineering, EEMCS, TU Delft, The Netherlands, http://ce.et.tudelft.nl Email: {w.m.vanoijen,
More informationDeveloping a Data Driven System for Computational Neuroscience
Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate
More informationHigh Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion
High Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion D. N. Sonawane 1 and M. S. Sutaone 2 1 Department of Instrumentation & Control 2 Department of Electronics
More informationHigh Throughput Energy Efficient Parallel FFT Architecture on FPGAs
High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu
More informationFPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations*
FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations* Wei Wu, Yi Shan, Xiaoming Chen, Yu Wang, and Huazhong Yang Department of Electronic Engineering, Tsinghua National Laboratory
More informationIterative Refinement on FPGAs
Iterative Refinement on FPGAs Tennessee Advanced Computing Laboratory University of Tennessee JunKyu Lee July 19 th 2011 This work was partially supported by the National Science Foundation, grant NSF
More informationSegment 1A. Introduction to Microcomputer and Microprocessor
Segment 1A Introduction to Microcomputer and Microprocessor 1.1 General Architecture of a Microcomputer System: The term microcomputer is generally synonymous with personal computer, or a computer that
More informationCREATED BY M BILAL & Arslan Ahmad Shaad Visit:
CREATED BY M BILAL & Arslan Ahmad Shaad Visit: www.techo786.wordpress.com Q1: Define microprocessor? Short Questions Chapter No 01 Fundamental Concepts Microprocessor is a program-controlled and semiconductor
More informationFPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011
FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level
More informationFPGA architecture and implementation of sparse matrix vector multiplication for the finite element method
Computer Physics Communications 178 (2008) 558 570 www.elsevier.com/locate/cpc FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method Yousef Elkurdi,
More informationIntroduction to Field Programmable Gate Arrays
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May 9 June 2007 Javier Serrano, CERN AB-CO-HT Outline Historical introduction.
More informationEfficient Multi-GPU CUDA Linear Solvers for OpenFOAM
Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,
More informationCHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP
133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located
More informationQUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection
QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection Sunil Shukla 1,2, Neil W. Bergmann 1, Jürgen Becker 2 1 ITEE, University of Queensland, Brisbane, QLD 4072, Australia {sunil, n.bergmann}@itee.uq.edu.au
More informationNew Computational Modeling for Solving Higher Order ODE based on FPGA
New Computational Modeling for Solving Higher Order ODE based on FPGA Alireza Fasih 1, Tuan Do Trong 2, Jean Chamberlain Chedjou 3, Kyandoghere Kyamakya 4 1, 3, 4 Alpen-Adria University of Klagenfurt Austria
More informationVirtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)
Virtex-II Architecture SONET / SDH Virtex II technical, Design Solutions PCI-X PCI DCM Distri RAM 18Kb BRAM Multiplier LVDS FIFO Shift Registers BLVDS SDRAM QDR SRAM Backplane Rev 4 March 4th. 2002 J-L
More informationReconfigurable Computing. Introduction
Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally
More informationSoC Basics Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017
1 2 3 4 Introduction - Cool new Stuff Everybody knows, that new technologies are usually driven by application requirements. A nice example for this is, that we developed portable super-computers with
More informationImproving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints
Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent
More informationInternational Journal of Advance Engineering and Research Development
Scientific Journal of Impact Factor (SJIF): 4.14 International Journal of Advance Engineering and Research Development Volume 3, Issue 11, November -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Review
More informationImplementation of a FIR Filter on a Partial Reconfigurable Platform
Implementation of a FIR Filter on a Partial Reconfigurable Platform Hanho Lee and Chang-Seok Choi School of Information and Communication Engineering Inha University, Incheon, 402-751, Korea hhlee@inha.ac.kr
More informationHonorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore
COMPUTER ORGANIZATION AND ARCHITECTURE V. Rajaraman Honorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore T. Radhakrishnan Professor of Computer Science
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationCNP: An FPGA-based Processor for Convolutional Networks
Clément Farabet clement.farabet@gmail.com Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio
More informationPipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications
, Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar
More informationAltera FLEX 8000 Block Diagram
Altera FLEX 8000 Block Diagram Figure from Altera technical literature FLEX 8000 chip contains 26 162 LABs Each LAB contains 8 Logic Elements (LEs), so a chip contains 208 1296 LEs, totaling 2,500 16,000
More information5. ReAl Systems on Silicon
THE REAL COMPUTER ARCHITECTURE PRELIMINARY DESCRIPTION 69 5. ReAl Systems on Silicon Programmable and application-specific integrated circuits This chapter illustrates how resource arrays can be incorporated
More informationXilinx DSP. High Performance Signal Processing. January 1998
DSP High Performance Signal Processing January 1998 New High Performance DSP Alternative New advantages in FPGA technology and tools: DSP offers a new alternative to ASICs, fixed function DSP devices,
More informationVendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs
Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115
More informationEE 3170 Microcontroller Applications
EE 3170 Microcontroller Applications Lecture 4 : Processors, Computers, and Controllers - 1.2 (reading assignment), 1.3-1.5 Based on slides for ECE3170 by Profs. Kieckhafer, Davis, Tan, and Cischke Outline
More informationStreaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs
Computer Science Faculty of EEMCS Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Master thesis August 15, 2008 Supervisor: dr.ir. A.B.J. Kokkeler Committee: dr.ir. A.B.J.
More information[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개
[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 정승혁과장 Senior Application Engineer MathWorks Korea 2015 The MathWorks, Inc. 1 Outline When FPGA, ASIC, or System-on-Chip (SoC) hardware is needed Hardware
More informationA Library of Parameterized Floating-point Modules and Their Use
A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu
More informationModeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors
Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)
More informationA VARIETY OF ICS ARE POSSIBLE DESIGNING FPGAS & ASICS. APPLICATIONS MAY USE STANDARD ICs or FPGAs/ASICs FAB FOUNDRIES COST BILLIONS
architecture behavior of control is if left_paddle then n_state
More informationAn FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic
An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic Pedro Echeverría, Marisa López-Vallejo Department of Electronic Engineering, Universidad Politécnica de Madrid
More informationEfficient Self-Reconfigurable Implementations Using On-Chip Memory
10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University
More informationPerformance and accuracy of hardware-oriented native-, solvers in FEM simulations
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationKeck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore
MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,
More informationController Synthesis for Hardware Accelerator Design
ler Synthesis for Hardware Accelerator Design Jiang, Hongtu; Öwall, Viktor 2002 Link to publication Citation for published version (APA): Jiang, H., & Öwall, V. (2002). ler Synthesis for Hardware Accelerator
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationOutline of Presentation Field Programmable Gate Arrays (FPGAs(
FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable
More informationTopics/Assignments. Class 10: Big Picture. What s Coming Next? Perspectives. So Far Mostly Programmer Perspective. Where are We? Where are We Going?
Fall 2006 CS333: Computer Architecture University of Virginia Computer Science Michele Co Topics/Assignments Class 10: Big Picture Survey Homework 1 Read Compilers and Computer Architecture Principles/factors
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More information