Perspectives for the Use of Field Programmable Gate Arrays for Finite Element Computations

Size: px
Start display at page:

Download "Perspectives for the Use of Field Programmable Gate Arrays for Finite Element Computations"

Transcription

1 Perspectives for the se of Field Programmable Gate Arrays for Finite Element Computations Gerhard ienhart 1, Daniel Gembris 1, Reinhard Männer 1 1 niversität Mannheim, {lienhart,gembris,maenner}@ti.uni-mannheim.de Abstract We have studied how the solution of partial differential equations by means of finite element methods could be accelerated using Field Programmable Gate Arrays (FPGAs). First, we discuss in general the capabilities of current FPGA technology for floating-point implementations of number crunching. Based on practical results for basic floating-point operators performance limits are outlined. Then the perspectives for the implementation of decomposition with a state-of-the-art FPGA chip are addressed. It is estimated that, compared with a modern CP, a speedup by a factor of can be expected using a single off-the-shelf FPGA. 1 Introduction In the last years there have been strong activities in the field of reconfigurable computing. In the focus of this emerging branch of computing are FPGAs, highly integrated semiconductor chips whose function can be freely specified on a gate-level in contrast to conventional CPs. Given a suitable computational problem, the design flexibility of these chips allows massive parallel computations, which can by far overcompensate the intrinsic disadvantage of lower clock speeds, being O(1/10) compared to state-of-the-art CPs (Compton & Hauck, 2002). One class of problems that could clearly benefit from the acceleration achievable with FPGAs is real-time FEM, a technique that is of relevance for virtual reality application, like medical surgery simulation or interactive engineering (Rhomberg et al. 1998, Bro-Nielsen 1998, Margetts 2005), or simulations in the context of robot control. In general, virtual reality applications require that any action of the user is followed by an immediate response of the simulation system, i.e. within less than about 20ms. An example would be that the user applies a force to a virtual object, which should result in a direct feedback mediating an impression of the evoked object deformation (mainly in terms of vision and haptics, provided by visual and tactile/haptic displays). While the demands on computation speed are high, the demands on accuracy are only moderate, since in a virtual reality application a qualitative agreement between the simulation and the corresponding real physical process is mostly sufficient. These characteristics make the use of FPGAs very attractive for solving the underlying FEM problems due to the high degree of parallelism and the reduction in bit width of the floating-point numbers possible. Figure 1 shows a schematic representation of an FPGA. The chip essentially consists of an array of configurable logic blocks interconnected by a programmable network. The logic blocks provide basic combinatorial logic realized by small look-up tables (Ts), flip-flops, and are usually enhanced with additional features for efficient data buffering and arithmetic functionality. Modern FPGAs contain many thousands of programmable logic blocks. For example, the largest of the new Virtex-4 FPGAs, XC4VFX140, manufactured by XIINX has slices, where a slice contains two Ts and two flip-flops. Modern FPGAs also comprise hard-wired circuits like multipliers and memories. For the above mentioned FPGA there are kbit memory elements and 192 so-called DSP48 cells which contain an 18 bit multiplier and additional logic for efficiently composing more complex integer operators, e.g.

2 multiply/add operators with a higher bit width. With modern FPGAs complex system on a chip (SoC) architectures can be built. These systems may consist of dedicated processing units as well as programmable units like soft-processors (e.g. MicroBlaze processor core, XIINX 2005). With the advent of modern FPGA technology it became possible to design custom computing machines without the need to construct application specific integrated circuits (ASICs). A common approach to make FPGA technology available for applications is to build a coprocessor board which can be installed in a host computer. The right side of Fig. 1 shows such a board. It has been developed at the niversity of Mannheim as a prototype system for different applications ranging from image processing to high-performance computing for astrophysical simulations. Configurable ogic Block Input/Output Block Configurable Interconnect Fig. 1. Schematic representation of an FPGA (left) and typical FPGA based PCI board (right). 2 Basics for FEM processing with FPGAs While for many areas, where FPGAs are already established (e.g. for signal processing), fixed-point number arithmetic is sufficient, number crunching applications usually require floating-point computations. Just a few years ago FPGAs became available, which are suitable for implementing complex calculation units for floating-point arithmetic. Related implementation issues have already been studied to some extent (Belanovic & eeser 2002, Roesler & Nelson 2002, nderwood 2004). Currently there are steadily increasing activities in the scientific and engineering community to apply reconfigurable architectures to problems of high-performance computing. For floating-point based high-performance computing see e.g. ienhart et al and Hamada & Nakasato Figure 2 and Tab. 1 demonstrate the relation between precision and FPGA logic utilization for different operators. imiting the precision requirements allows for highly parallel calculation units. The above mentioned FPGA XC4VFX140 may e.g. contain up to 150 adders and 50 multipliers with single-precision or 60 adders and 20 multipliers with double-precision. When only using multiply-add units we can estimate a peak performance of 20 GFOPs for single-precision (50 units) or 8 GFOPs for double-precision (20 units). Number of elements x Mult18 4 x Mult18 Multiplier Slices Adder Slices 0.5 % of XC4FX Mantissa width Fig. 2. Scaling of logic consumption of adders and multipliers from the niversity of Mannheim for different mantissa width (single-precision corresponds to 24 bit mantissa, double-precision to 53 bit).

3 Tab. 1. ogic consumption, speed and latency (clock cycles) of different single-precision (SP) and double-precision (DP) floating-point operators for XC4VFX-10 FPGAs with libraries from niversity of Mannheim and XIINX Inc. (information for XIINX operators according to data sheet, XIINX 2005). niversity of Mannheim XIINX Inc. Operator Virtex 4 DSP48 Speed at. Operator Virtex 4 DSP48 Speed at. Slices Cells (MHz) Slices Cells (MHz) SP Adder SP Adder SP Multiplier SP Multiplier SP Divider SP Divider SP Accumulator DP Adder DP Multiplier DP Multiplier DP Divider Target algorithms Generally the finite element method leads to a set of coupled partial differential equations which can be approximated by a system of linear equations A x = b. sually the Matrix A is sparse and, depending on the type of underlying problem, exhibits some additional properties like symmetry or diagonal dominance. For small problem size, the direct solution via decomposition is generally applicable. A standard approach in these cases is the use of multifrontal methods like those implemented in MFPACK (Davis & Duff 1997) explained below. The decomposition method solves a linear system by forward substitution with a lower () and then backward substitution with an upper () triangular matrix: A x = ( ) x = b y = b x = y A simple algorithm for performing the decomposition of an n x n Matrix is Crout's algorithm which successively performs following calculations for j = 1K n : ii 1 = A 1 = jj A i 1 k= 1 ik j 1 k= 1 kj ik kj i = 1K j i = j + 1K n In general, decomposition is an algorithm of complexity O(n 3 ) and the forward and backward substitution has complexity O(n 2 ). But the number of non-zero coefficients in sparse matrices for FEM usually scales with O(n) instead of O(n 2 ) in the general case, which may reduce the effort dramatically. For optimizing memory requirements and access pattern, the multifrontal method was developed (for a review see iu 1992). The method organizes the numerical factorization into a series of steps which involve partial factorization of dense smaller matrices (frontal matrices). The steps are processed according to an elimination tree structure, which provides a natural framework for parallel computations. The method allows for better data locality, efficient outof-core schemes, and avoids indirect addressing within the factorization steps. decomposition can be widely used and is the default method in FEMAB for 1D and 2D problems. Even when iterative solvers are more appropriate, (incomplete) factorization can be used as a preconditioner (but algebraic multigrid methods are often preferred).

4 2.2 Related work Previous work about FPGA implementations of linear solvers mainly focuses on direct decomposition with algorithms, which are simple to handle. Efficient algorithms which may be applied more generally like multifrontal methods with pivoting and complex submatrix handling are still subject to research. It has already been studied how basic linear algebra processing can be realized with FPGAs like matrix-vector multiplication (for recent work see e.g. nderwood & Hemmert, 2004, Zhuo & Prasanna 2005 and Deorimier & DeHon 2005). Delorimier presents e.g. a dedicated processing unit for matrix-vector multiplication with double-precision, which achieves a sustained performance of 1.5 GFOPs on a single FPGA (XC2V6000-4). nderwood provides trends and performance comparisons for different BAS library operations on FPGAs and CPs. For example, a performance of 4 GFOPs on a single FPGA (XC2VP100-6) for matrix multiplication at double-precision was reported. Concerning decomposition there are FPGA implementations for different approaches of block-based algorithms (Choi & Prasanna 2003, Daga et al. 2004, Wang & Ziavras 2004). To our knowledge in the previous work no pivoting is applied, which limits the applicability to special problems. The design of Wang and Ziavras was developed for equation systems which can be represented by bordered-diagonal-block sparse matrices. A multiprocessor system on a single FPGA has been implemented, but unfortunately the performance was not compared to standard CPs. Daga et al. present a decomposition for large-sized matrices (n = 1000) which is not optimized for sparse systems. For their test case they claim a speed-up of 23x in total computing time with double-precision calculation on a very large FPGA (Virtex-2 Pro XC2VP125) compared to a Pentium M, 1.4 GHz CP. 3 Proposed architecture We propose an architecture with multiple equal processing elements (), each consisting of a dedicated computing element for linear algebra processing with local memory controlled by a soft-processor. The design of the computing element is optimized for the basic linear algebra processing tasks like vector dot product, which are required for decomposition. The basic computational step for these tasks is a multiply-accumulation operation, to which the primary building blocks are hence dedicated. The left side of Fig. 3 shows the structure of a, which can be used in a flexible manner due to the multiplexers between the operators. This architecture allows independent multiplication and addition as well as vectorized multiply-accumulate operations. In order to provide a sustained performance close to peak performance a dedicated accumulator is used in addition to the adder. Our floating-point accumulator separately accumulates positive and negative values. Therefore a final add must be performed at the end of an accumulation, wherefore the loop-back from the accumulator output to the inputs of the adder is routed. Basic linear algebra operations are done automatically, controlled by the ATC unit. The data flow is controlled by the Operand Fetch and Result Dispatch units. The operations are performed on local memory (internal FPGA dual-port block-ram). The attached processor core cares for data management and processing control by assembling data in the memory and triggering operations for the ATC unit. sing a processor for this purpose provides the flexibility to apply enhanced factorization algorithms like the multifrontal method. We propose to use the XIINX MicroBlaze soft-processor. With an on-chip-memory interface, multiple special point-to-point communication channels to attached circuits and an on-chip peripheral bus to connect to the environment, this core provides all necessary interfaces to assemble the s as drawn.

5 At the right side of Fig. 3 the overall architecture of the proposed FPGA design is shown. Several units share a single divider as this operation is rarely needed. The s are connected with each other and the off-chip memory as well as with the host interface via the onchip peripheral bus of the processor cores. The number m of s which share a divider is a parameter and depends on the target application, where m=8 would be a reasonable choice. The detailed design of the interconnection between s and the host and memory interfaces would depend on the respective reconfigurable platform. I/O Dual Port Mem Processor Core Operand Fetch x + ACC Result Dispatch inear Algebra Task Control (ATC) 0 1/x m-1 Off-Chip Memory Interface n-m 1/x n-1 Host Interface Processing Element () Fig. 3. Proposed processing element (left) and overall architecture (right). 4 Performance analysis With the above outlined design we can now estimate the performance achievable in FEM processing. A single calculation unit consisting of a multiplier, an adder and an accumulator consumes about 1000 slices and 4 DSP48 cells. The MicroBlaze soft-processor adds about 500 slices and we can charge another 200 slices for control and data management. Altogether a will require about 1700 slices. Therefore the mentioned FPGA XCVFX140 may house 32 processing elements including 8 single-precision dividers. The on-chip memory allows local dual-port buffers of 38 KBytes per. Taking into account, that a significant fraction of this memory is used for instruction memory for the processor core, this amount of memory limits a to handle submatrices with a size of up to about 70x70. Calculations with larger matrices need to be divided among more than one. Assuming a conservative clock frequency of 160 MHz and two concurrent floating-point operations per we get a peak performance of 10 GFOPs on a single FPGA. Estimating an efficiency of more than 50 %, which is reasonable for custom computing architectures, we get a sustained performance of about 5 GFOPs. Assuming 300 MFOPs for a general-purpose CP, which is realistic as we usually see an efficency of less than 10 % for the solution of sparse systems, we can expect a speedup of When going to double-precision a single processing element would consume about 3300 slices and 9 DSP48 cells. Therefore only 16 s will fit on the same FPGA. Estimating the performance in the same way as above, a peak performance of 5 GFOPs and an estimated speedup of 5-10 results. About 25 % of logic ressources could be saved by using the adder in the also for accumulation. Because of the latency of the adder, performing the multiply-accumulate operations in an efficient way would become much more complex and would require an additional feedback FIFO between the output of the adder and one of its inputs. Then it was also more difficult to get close to peak performance, i.e. the algorithm had to allow a much higher parallelism in order to hide the adder latency. But such a design would enable a 48-fold parallelization with a peak performance of more than 15 GFOPs.

6 5 Conclusions We demonstrated the potential of reconfigurable computing for solving the linear equations from FEM problems. A dedicated FPGA design consisting of multiple entities of equal processing elements was outlined and the resulting performance of 10 GFOPs for singleprecision estimated. Because of the integrated processor cores the architecture is capable of dealing with different algorithms for decomposition. The design should also provide good performance for other solvers like those based on the method of conjugate gradient (iterative) or algebraic multigrid solvers. The efficiency and therefore the overall speedup depends on how efficiently the underlying algorithms can be parallelized to perform independent basic linear algebra operations on local memory. For FPGA computing machines this is a major topic of research and the prospects of computation power for number crunching strongly motivates further work in this direction. References Belanovic, P.; eeser, M. (2002) A ibrary of Parameterized Floating-Point Modules and Their se. Proc. FP'02, pp Bro-Nielsen, M. (1998) Finite element modeling in surgery simulation. Proc. of the IEEE 86(3): Choi, S.; Prasanna, V.K. (2003) Time and Energy Efficient Matrix Factorization using FPGAs. Proc. of FP'03. Compton, K.; Hauck, S. (2002) Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34(2): Davis, T.A.; Duff, I. (1997) An nsymmetric-pattern Multifrontal Method for Sparse Factorization. SIAM J. Matrix Anal. Appl. 18(1): Deorimier, M.; DeHon, A. (2005) Floating-Point Sparse Matrix-Vector Multiply for FPGAs. Proc. FPGA'05. Hamada, T.; Nakasato, N (2005) PGR: A Software Package for Reconfigurable Super-Computing, Proc. FP'05, pp Margetts,.; Smethurst, C.; Ford, R. (2005) Interactive Finite Element Analysis. NAFEMS World Congress, Malta, May Johnson, J.R.; Nagvarjara, P.; Nwankpa, C. (2004) Sparse inear Solver for Power System Analysis using FPGA. Proc. HC'04. Heath, M.T. (1997) Parallel direct methods for sparse linear systems. Parallel Numerical Algorithms, Kluwer, Boston, pp ienhart, G.; Kugel, A.; Männer, R. (2002) sing Floating Point Arithmetic on FPGAs for Accelerating Scientific N-Body Simulations. Proc. FCCM'02. iu, J. (1992) The Multifrontal Method for Sparse Matrix Solution: Theory and Practice. SIAM Review 34: Rhomberg, A.; Enzler, R.; Thaler, M.; Tröster, G. (1998) Design of a FEM Computation Engine for Real-Time aparoscopic Surgery Simulation. Proc. of the IPPS/SPDP, pp Roesler, E.; Nelson, B.E. (2002) Novel Optimizations for Hardware Floating-Point nits in a Modern FPGA Architecture. Proc. FP'02, pp nderwood, K. (2004) FPGAs vs. CPs: Trends in Peak Floating-Point Performance, K. nderwood, Proc. FPGA'04. nderwood, K.; Hemmert K.S. (2004) Closing the Gap: CP and FPGA Trends in Sustainable Floating-Point BAS Performance. Proc. FCCM'04. Wang, X.; Ziavras, S.G. (2004) Parallel factorization of sparse matrices on FPGA-based configurable computing engines. Concurrency Computat.: Pract. Exper. 16: XIINX Inc. (2005) Floating-Point Operator V1.0 Product Specification. DS335: XIINX Inc. (2005) MicroBlaze Processor Reference Guide. Zhuo,.; Prasanna, V.K. (2005) Sparse Matrix-Vector Multiplication on FPGAs. Proc. FPGA'05. Xu, X.; Ziavras, S.G. (2003) Iterative Methods for Solving inear Systems of equations on FPGA- Based Machines. Proc. Computers and their Applications, pp

Sparse LU Decomposition using FPGA

Sparse LU Decomposition using FPGA Sparse LU Decomposition using FPGA Jeremy Johnson 1, Timothy Chagnon 1, Petya Vachranukunkiet 2, Prawat Nagvajara 2, and Chika Nwankpa 2 CS 1 and ECE 2 Departments Drexel University, Philadelphia, PA jjohnson@cs.drexel.edu,tchagnon@drexel.edu,pv29@drexel.edu,

More information

PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1

PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 J. Johnson, P. Vachranukunkiet, S. Tiwari, P. Nagvajara, C. Nwankpa Drexel University Philadelphia, PA Abstract Full-AC load flow constitutes

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH

FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Embedded Floating-Point Units in FPGAs

Embedded Floating-Point Units in FPGAs Embedded Floating-Point Units in FPGAs Michael J. Beauchamp Scott Hauck Keith D. Underwood K. Scott Hemmert University of Washington Sandia National Laboratories* Dept. of Electrical Engineering Scalable

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

High-Performance Linear Algebra Processor using FPGA

High-Performance Linear Algebra Processor using FPGA High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible

More information

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture

Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Simulation & Synthesis of FPGA Based & Resource Efficient Matrix Coprocessor Architecture Jai Prakash Mishra 1, Mukesh Maheshwari 2 1 M.Tech Scholar, Electronics & Communication Engineering, JNU Jaipur,

More information

Floating-Point Matrix Product on FPGA

Floating-Point Matrix Product on FPGA Floating-Point Matrix Product on FPGA Faycal Bensaali University of Hertfordshire f.bensaali@herts.ac.uk Abbes Amira Brunel University abbes.amira@brunel.ac.uk Reza Sotudeh University of Hertfordshire

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Sparse Linear Solver for Power System Analyis using FPGA

Sparse Linear Solver for Power System Analyis using FPGA Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Analysis of High-performance Floating-point Arithmetic on FPGAs

Analysis of High-performance Floating-point Arithmetic on FPGAs Analysis of High-performance Floating-point Arithmetic on FPGAs Gokul Govindu, Ling Zhuo, Seonil Choi and Viktor Prasanna Dept. of Electrical Engineering University of Southern California Los Angeles,

More information

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns

A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns Xinying Wang, Phillip H. Jones and Joseph Zambreno Department of Electrical and Computer Engineering Iowa State

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation

Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation Pricing of Derivatives by Fast, Hardware-Based Monte-Carlo Simulation Prof. Dr. Joachim K. Anlauf Universität Bonn Institut für Informatik II Technische Informatik Römerstr. 164 53117 Bonn E-Mail: anlauf@informatik.uni-bonn.de

More information

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage silage@temple.edu ECE Temple University www.temple.edu/scdl Signal Processing Algorithms into Fixed Point FPGA Hardware Motivation

More information

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA

More information

Field Programmable Gate Array (FPGA)

Field Programmable Gate Array (FPGA) Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems

More information

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Mapping Sparse Matrix-Vector Multiplication on FPGAs Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 University of Tennessee, Knoxville 1 Oak Ridge National Laboratory 2 [jsun5, gdp]@utk.edu 1, Olaf@ornl.gov

More information

An Integrated Reduction Technique for a Double Precision Accumulator

An Integrated Reduction Technique for a Double Precision Accumulator An Integrated Reduction Technique for a Double Precision Accumulator Krishna K. Nagar Dept. of Computer Sci.and Engr. University of South Carolina Columbia, SC 29208 USA nagar@cse.sc.edu Yan Zhang Dept.

More information

Digital Integrated Circuits

Digital Integrated Circuits Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market

More information

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,

More information

A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS

A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS 2011 21st International Conference on Field Programmable Logic and Applications A MODEL FOR MATRIX MULTIPLICATION PERFORMANCE ON FPGAS Colin Yu Lin, Hayden Kwok-Hay So Electrical and Electronic Engineering

More information

ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC

ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC Chun Te Ewe, Peter Y. K. Cheung and George A. Constantinides Department of Electrical & Electronic Engineering,

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Introduction to FPGA design Rakesh Gangarajaiah Rakesh.gangarajaiah@eit.lth.se Slides from Chenxin Zhang and Steffan Malkowsky WWW.FPGA What is FPGA? Field

More information

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Performance Analysis of CORDIC Architectures Targeted by FPGA Devices Guddeti Nagarjuna Reddy 1, R.Jayalakshmi 2, Dr.K.Umapathy

More information

3.1 Description of Microprocessor. 3.2 History of Microprocessor

3.1 Description of Microprocessor. 3.2 History of Microprocessor 3.0 MAIN CONTENT 3.1 Description of Microprocessor The brain or engine of the PC is the processor (sometimes called microprocessor), or central processing unit (CPU). The CPU performs the system s calculating

More information

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs Kiran Kumar Matam Computer Science Department University of Southern California Email: kmatam@usc.edu Hoang Le and Viktor K.

More information

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) Bill Jason P. Tomas Dept. of Electrical and Computer Engineering University of Nevada Las Vegas FIELD PROGRAMMABLE ARRAYS Dominant digital design

More information

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented. native-, solvers in FEM simulations Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices

More information

Multiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard

Multiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard Multiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard Vítor Silva 1,RuiDuarte 1,Mário Véstias 2,andHorácio Neto 1 1 INESC-ID/IST/UTL, Technical University of Lisbon,

More information

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA L2: FPGA HARDWARE 18-545: ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA 18-545: FALL 2014 2 Admin stuff Project Proposals happen on Monday Be prepared to give an in-class presentation Lab 1 is

More information

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double

More information

GPU vs FPGA : A comparative analysis for non-standard precision

GPU vs FPGA : A comparative analysis for non-standard precision GPU vs FPGA : A comparative analysis for non-standard precision Umar Ibrahim Minhas, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA Predrag Radosavljevic, Alexandre de Baynast, Marjan Karkooti, and Joseph R. Cavallaro Department of Electrical

More information

Field Program mable Gate Arrays

Field Program mable Gate Arrays Field Program mable Gate Arrays M andakini Patil E H E P g r o u p D H E P T I F R SERC school NISER, Bhubaneshwar Nov 7-27 2017 Outline Digital electronics Short history of programmable logic devices

More information

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to

More information

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY

FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY FPGA ACCELERATION OF THE LINPACK BENCHMARK USING HANDEL-C AND THE CELOXICA FLOATING POINT LIBRARY Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

FPGA architecture and design technology

FPGA architecture and design technology CE 435 Embedded Systems Spring 2017 FPGA architecture and design technology Nikos Bellas Computer and Communications Engineering Department University of Thessaly 1 FPGA fabric A generic island-style FPGA

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK

The DSP Primer 8. FPGA Technology. DSPprimer Home. DSPprimer Notes. August 2005, University of Strathclyde, Scotland, UK The DSP Primer 8 FPGA Technology Return DSPprimer Home Return DSPprimer Notes August 2005, University of Strathclyde, Scotland, UK For Academic Use Only THIS SLIDE IS BLANK August 2005, For Academic Use

More information

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation Antonio Roldao Lopes and George A Constantinides Electrical & Electronic Engineering, Imperial College London, London SW7 2BT,

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

Outline. Field Programmable Gate Arrays. Programming Technologies Architectures. Programming Interfaces. Historical perspective

Outline. Field Programmable Gate Arrays. Programming Technologies Architectures. Programming Interfaces. Historical perspective Outline Field Programmable Gate Arrays Historical perspective Programming Technologies Architectures PALs, PLDs,, and CPLDs FPGAs Programmable logic Interconnect network I/O buffers Specialized cores Programming

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm Pallavi Ramteke 1, Dr. N. N. Mhala 2, Prof. P. R. Lakhe M.Tech [IV Sem], Dept. of Comm. Engg., S.D.C.E, [Selukate],

More information

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Sashisu Bajracharya, Deapesh Misra, Kris Gaj George Mason University Tarek El-Ghazawi The George Washington

More information

Matrix Multiplication Implementation in the MOLEN Polymorphic Processor

Matrix Multiplication Implementation in the MOLEN Polymorphic Processor Matrix Multiplication Implementation in the MOLEN Polymorphic Processor Wouter M. van Oijen Georgi K. Kuzmanov Computer Engineering, EEMCS, TU Delft, The Netherlands, http://ce.et.tudelft.nl Email: {w.m.vanoijen,

More information

Developing a Data Driven System for Computational Neuroscience

Developing a Data Driven System for Computational Neuroscience Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate

More information

High Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion

High Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion High Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion D. N. Sonawane 1 and M. S. Sutaone 2 1 Department of Instrumentation & Control 2 Department of Electronics

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations*

FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations* FPGA Accelerated Parallel Sparse Matrix Factorization for Circuit Simulations* Wei Wu, Yi Shan, Xiaoming Chen, Yu Wang, and Huazhong Yang Department of Electronic Engineering, Tsinghua National Laboratory

More information

Iterative Refinement on FPGAs

Iterative Refinement on FPGAs Iterative Refinement on FPGAs Tennessee Advanced Computing Laboratory University of Tennessee JunKyu Lee July 19 th 2011 This work was partially supported by the National Science Foundation, grant NSF

More information

Segment 1A. Introduction to Microcomputer and Microprocessor

Segment 1A. Introduction to Microcomputer and Microprocessor Segment 1A Introduction to Microcomputer and Microprocessor 1.1 General Architecture of a Microcomputer System: The term microcomputer is generally synonymous with personal computer, or a computer that

More information

CREATED BY M BILAL & Arslan Ahmad Shaad Visit:

CREATED BY M BILAL & Arslan Ahmad Shaad Visit: CREATED BY M BILAL & Arslan Ahmad Shaad Visit: www.techo786.wordpress.com Q1: Define microprocessor? Short Questions Chapter No 01 Fundamental Concepts Microprocessor is a program-controlled and semiconductor

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method

FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method Computer Physics Communications 178 (2008) 558 570 www.elsevier.com/locate/cpc FPGA architecture and implementation of sparse matrix vector multiplication for the finite element method Yousef Elkurdi,

More information

Introduction to Field Programmable Gate Arrays

Introduction to Field Programmable Gate Arrays Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May 9 June 2007 Javier Serrano, CERN AB-CO-HT Outline Historical introduction.

More information

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM Alexander Monakov, amonakov@ispras.ru Institute for System Programming of Russian Academy of Sciences March 20, 2013 1 / 17 Problem Statement In OpenFOAM,

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection

QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection QUKU: A Fast Run Time Reconfigurable Platform for Image Edge Detection Sunil Shukla 1,2, Neil W. Bergmann 1, Jürgen Becker 2 1 ITEE, University of Queensland, Brisbane, QLD 4072, Australia {sunil, n.bergmann}@itee.uq.edu.au

More information

New Computational Modeling for Solving Higher Order ODE based on FPGA

New Computational Modeling for Solving Higher Order ODE based on FPGA New Computational Modeling for Solving Higher Order ODE based on FPGA Alireza Fasih 1, Tuan Do Trong 2, Jean Chamberlain Chedjou 3, Kyandoghere Kyamakya 4 1, 3, 4 Alpen-Adria University of Klagenfurt Austria

More information

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued) Virtex-II Architecture SONET / SDH Virtex II technical, Design Solutions PCI-X PCI DCM Distri RAM 18Kb BRAM Multiplier LVDS FIFO Shift Registers BLVDS SDRAM QDR SRAM Backplane Rev 4 March 4th. 2002 J-L

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

SoC Basics Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017

SoC Basics Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017 1 2 3 4 Introduction - Cool new Stuff Everybody knows, that new technologies are usually driven by application requirements. A nice example for this is, that we developed portable super-computers with

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

International Journal of Advance Engineering and Research Development

International Journal of Advance Engineering and Research Development Scientific Journal of Impact Factor (SJIF): 4.14 International Journal of Advance Engineering and Research Development Volume 3, Issue 11, November -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Review

More information

Implementation of a FIR Filter on a Partial Reconfigurable Platform

Implementation of a FIR Filter on a Partial Reconfigurable Platform Implementation of a FIR Filter on a Partial Reconfigurable Platform Hanho Lee and Chang-Seok Choi School of Information and Communication Engineering Inha University, Incheon, 402-751, Korea hhlee@inha.ac.kr

More information

Honorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore

Honorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore COMPUTER ORGANIZATION AND ARCHITECTURE V. Rajaraman Honorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore T. Radhakrishnan Professor of Computer Science

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

CNP: An FPGA-based Processor for Convolutional Networks

CNP: An FPGA-based Processor for Convolutional Networks Clément Farabet clement.farabet@gmail.com Computational & Biological Learning Laboratory Courant Institute, NYU Joint work with: Yann LeCun, Cyril Poulet, Jefferson Y. Han Now collaborating with Eugenio

More information

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications , Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar

More information

Altera FLEX 8000 Block Diagram

Altera FLEX 8000 Block Diagram Altera FLEX 8000 Block Diagram Figure from Altera technical literature FLEX 8000 chip contains 26 162 LABs Each LAB contains 8 Logic Elements (LEs), so a chip contains 208 1296 LEs, totaling 2,500 16,000

More information

5. ReAl Systems on Silicon

5. ReAl Systems on Silicon THE REAL COMPUTER ARCHITECTURE PRELIMINARY DESCRIPTION 69 5. ReAl Systems on Silicon Programmable and application-specific integrated circuits This chapter illustrates how resource arrays can be incorporated

More information

Xilinx DSP. High Performance Signal Processing. January 1998

Xilinx DSP. High Performance Signal Processing. January 1998 DSP High Performance Signal Processing January 1998 New High Performance DSP Alternative New advantages in FPGA technology and tools: DSP offers a new alternative to ASICs, fixed function DSP devices,

More information

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115

More information

EE 3170 Microcontroller Applications

EE 3170 Microcontroller Applications EE 3170 Microcontroller Applications Lecture 4 : Processors, Computers, and Controllers - 1.2 (reading assignment), 1.3-1.5 Based on slides for ECE3170 by Profs. Kieckhafer, Davis, Tan, and Cischke Outline

More information

Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs

Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Computer Science Faculty of EEMCS Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Master thesis August 15, 2008 Supervisor: dr.ir. A.B.J. Kokkeler Committee: dr.ir. A.B.J.

More information

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 [Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개 정승혁과장 Senior Application Engineer MathWorks Korea 2015 The MathWorks, Inc. 1 Outline When FPGA, ASIC, or System-on-Chip (SoC) hardware is needed Hardware

More information

A Library of Parameterized Floating-point Modules and Their Use

A Library of Parameterized Floating-point Modules and Their Use A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu

More information

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)

More information

An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic

An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic An FPGA Implementation of the Powering Function with Single Precision Floating-Point Arithmetic Pedro Echeverría, Marisa López-Vallejo Department of Electronic Engineering, Universidad Politécnica de Madrid

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations

Performance and accuracy of hardware-oriented native-, solvers in FEM simulations Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore

Keck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,

More information

Controller Synthesis for Hardware Accelerator Design

Controller Synthesis for Hardware Accelerator Design ler Synthesis for Hardware Accelerator Design Jiang, Hongtu; Öwall, Viktor 2002 Link to publication Citation for published version (APA): Jiang, H., & Öwall, V. (2002). ler Synthesis for Hardware Accelerator

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable

More information

Topics/Assignments. Class 10: Big Picture. What s Coming Next? Perspectives. So Far Mostly Programmer Perspective. Where are We? Where are We Going?

Topics/Assignments. Class 10: Big Picture. What s Coming Next? Perspectives. So Far Mostly Programmer Perspective. Where are We? Where are We Going? Fall 2006 CS333: Computer Architecture University of Virginia Computer Science Michele Co Topics/Assignments Class 10: Big Picture Survey Homework 1 Read Compilers and Computer Architecture Principles/factors

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information