Perspectives for the Use of Field Programmable Gate Arrays for Finite Element Computations

Size: px

Start display at page:

Download "Perspectives for the Use of Field Programmable Gate Arrays for Finite Element Computations"

Rosalind Fitzgerald
6 years ago
Views:

1 Perspectives for the se of Field Programmable Gate Arrays for Finite Element Computations Gerhard ienhart 1, Daniel Gembris 1, Reinhard Männer 1 1 niversität Mannheim, {lienhart,gembris,maenner}@ti.uni-mannheim.de Abstract We have studied how the solution of partial differential equations by means of finite element methods could be accelerated using Field Programmable Gate Arrays (FPGAs). First, we discuss in general the capabilities of current FPGA technology for floating-point implementations of number crunching. Based on practical results for basic floating-point operators performance limits are outlined. Then the perspectives for the implementation of decomposition with a state-of-the-art FPGA chip are addressed. It is estimated that, compared with a modern CP, a speedup by a factor of can be expected using a single off-the-shelf FPGA. 1 Introduction In the last years there have been strong activities in the field of reconfigurable computing. In the focus of this emerging branch of computing are FPGAs, highly integrated semiconductor chips whose function can be freely specified on a gate-level in contrast to conventional CPs. Given a suitable computational problem, the design flexibility of these chips allows massive parallel computations, which can by far overcompensate the intrinsic disadvantage of lower clock speeds, being O(1/10) compared to state-of-the-art CPs (Compton & Hauck, 2002). One class of problems that could clearly benefit from the acceleration achievable with FPGAs is real-time FEM, a technique that is of relevance for virtual reality application, like medical surgery simulation or interactive engineering (Rhomberg et al. 1998, Bro-Nielsen 1998, Margetts 2005), or simulations in the context of robot control. In general, virtual reality applications require that any action of the user is followed by an immediate response of the simulation system, i.e. within less than about 20ms. An example would be that the user applies a force to a virtual object, which should result in a direct feedback mediating an impression of the evoked object deformation (mainly in terms of vision and haptics, provided by visual and tactile/haptic displays). While the demands on computation speed are high, the demands on accuracy are only moderate, since in a virtual reality application a qualitative agreement between the simulation and the corresponding real physical process is mostly sufficient. These characteristics make the use of FPGAs very attractive for solving the underlying FEM problems due to the high degree of parallelism and the reduction in bit width of the floating-point numbers possible. Figure 1 shows a schematic representation of an FPGA. The chip essentially consists of an array of configurable logic blocks interconnected by a programmable network. The logic blocks provide basic combinatorial logic realized by small look-up tables (Ts), flip-flops, and are usually enhanced with additional features for efficient data buffering and arithmetic functionality. Modern FPGAs contain many thousands of programmable logic blocks. For example, the largest of the new Virtex-4 FPGAs, XC4VFX140, manufactured by XIINX has slices, where a slice contains two Ts and two flip-flops. Modern FPGAs also comprise hard-wired circuits like multipliers and memories. For the above mentioned FPGA there are kbit memory elements and 192 so-called DSP48 cells which contain an 18 bit multiplier and additional logic for efficiently composing more complex integer operators, e.g.

2 multiply/add operators with a higher bit width. With modern FPGAs complex system on a chip (SoC) architectures can be built. These systems may consist of dedicated processing units as well as programmable units like soft-processors (e.g. MicroBlaze processor core, XIINX 2005). With the advent of modern FPGA technology it became possible to design custom computing machines without the need to construct application specific integrated circuits (ASICs). A common approach to make FPGA technology available for applications is to build a coprocessor board which can be installed in a host computer. The right side of Fig. 1 shows such a board. It has been developed at the niversity of Mannheim as a prototype system for different applications ranging from image processing to high-performance computing for astrophysical simulations. Configurable ogic Block Input/Output Block Configurable Interconnect Fig. 1. Schematic representation of an FPGA (left) and typical FPGA based PCI board (right). 2 Basics for FEM processing with FPGAs While for many areas, where FPGAs are already established (e.g. for signal processing), fixed-point number arithmetic is sufficient, number crunching applications usually require floating-point computations. Just a few years ago FPGAs became available, which are suitable for implementing complex calculation units for floating-point arithmetic. Related implementation issues have already been studied to some extent (Belanovic & eeser 2002, Roesler & Nelson 2002, nderwood 2004). Currently there are steadily increasing activities in the scientific and engineering community to apply reconfigurable architectures to problems of high-performance computing. For floating-point based high-performance computing see e.g. ienhart et al and Hamada & Nakasato Figure 2 and Tab. 1 demonstrate the relation between precision and FPGA logic utilization for different operators. imiting the precision requirements allows for highly parallel calculation units. The above mentioned FPGA XC4VFX140 may e.g. contain up to 150 adders and 50 multipliers with single-precision or 60 adders and 20 multipliers with double-precision. When only using multiply-add units we can estimate a peak performance of 20 GFOPs for single-precision (50 units) or 8 GFOPs for double-precision (20 units). Number of elements x Mult18 4 x Mult18 Multiplier Slices Adder Slices 0.5 % of XC4FX Mantissa width Fig. 2. Scaling of logic consumption of adders and multipliers from the niversity of Mannheim for different mantissa width (single-precision corresponds to 24 bit mantissa, double-precision to 53 bit).

3 Tab. 1. ogic consumption, speed and latency (clock cycles) of different single-precision (SP) and double-precision (DP) floating-point operators for XC4VFX-10 FPGAs with libraries from niversity of Mannheim and XIINX Inc. (information for XIINX operators according to data sheet, XIINX 2005). niversity of Mannheim XIINX Inc. Operator Virtex 4 DSP48 Speed at. Operator Virtex 4 DSP48 Speed at. Slices Cells (MHz) Slices Cells (MHz) SP Adder SP Adder SP Multiplier SP Multiplier SP Divider SP Divider SP Accumulator DP Adder DP Multiplier DP Multiplier DP Divider Target algorithms Generally the finite element method leads to a set of coupled partial differential equations which can be approximated by a system of linear equations A x = b. sually the Matrix A is sparse and, depending on the type of underlying problem, exhibits some additional properties like symmetry or diagonal dominance. For small problem size, the direct solution via decomposition is generally applicable. A standard approach in these cases is the use of multifrontal methods like those implemented in MFPACK (Davis & Duff 1997) explained below. The decomposition method solves a linear system by forward substitution with a lower () and then backward substitution with an upper () triangular matrix: A x = ( ) x = b y = b x = y A simple algorithm for performing the decomposition of an n x n Matrix is Crout's algorithm which successively performs following calculations for j = 1K n : ii 1 = A 1 = jj A i 1 k= 1 ik j 1 k= 1 kj ik kj i = 1K j i = j + 1K n In general, decomposition is an algorithm of complexity O(n 3 ) and the forward and backward substitution has complexity O(n 2 ). But the number of non-zero coefficients in sparse matrices for FEM usually scales with O(n) instead of O(n 2 ) in the general case, which may reduce the effort dramatically. For optimizing memory requirements and access pattern, the multifrontal method was developed (for a review see iu 1992). The method organizes the numerical factorization into a series of steps which involve partial factorization of dense smaller matrices (frontal matrices). The steps are processed according to an elimination tree structure, which provides a natural framework for parallel computations. The method allows for better data locality, efficient outof-core schemes, and avoids indirect addressing within the factorization steps. decomposition can be widely used and is the default method in FEMAB for 1D and 2D problems. Even when iterative solvers are more appropriate, (incomplete) factorization can be used as a preconditioner (but algebraic multigrid methods are often preferred).

4 2.2 Related work Previous work about FPGA implementations of linear solvers mainly focuses on direct decomposition with algorithms, which are simple to handle. Efficient algorithms which may be applied more generally like multifrontal methods with pivoting and complex submatrix handling are still subject to research. It has already been studied how basic linear algebra processing can be realized with FPGAs like matrix-vector multiplication (for recent work see e.g. nderwood & Hemmert, 2004, Zhuo & Prasanna 2005 and Deorimier & DeHon 2005). Delorimier presents e.g. a dedicated processing unit for matrix-vector multiplication with double-precision, which achieves a sustained performance of 1.5 GFOPs on a single FPGA (XC2V6000-4). nderwood provides trends and performance comparisons for different BAS library operations on FPGAs and CPs. For example, a performance of 4 GFOPs on a single FPGA (XC2VP100-6) for matrix multiplication at double-precision was reported. Concerning decomposition there are FPGA implementations for different approaches of block-based algorithms (Choi & Prasanna 2003, Daga et al. 2004, Wang & Ziavras 2004). To our knowledge in the previous work no pivoting is applied, which limits the applicability to special problems. The design of Wang and Ziavras was developed for equation systems which can be represented by bordered-diagonal-block sparse matrices. A multiprocessor system on a single FPGA has been implemented, but unfortunately the performance was not compared to standard CPs. Daga et al. present a decomposition for large-sized matrices (n = 1000) which is not optimized for sparse systems. For their test case they claim a speed-up of 23x in total computing time with double-precision calculation on a very large FPGA (Virtex-2 Pro XC2VP125) compared to a Pentium M, 1.4 GHz CP. 3 Proposed architecture We propose an architecture with multiple equal processing elements (), each consisting of a dedicated computing element for linear algebra processing with local memory controlled by a soft-processor. The design of the computing element is optimized for the basic linear algebra processing tasks like vector dot product, which are required for decomposition. The basic computational step for these tasks is a multiply-accumulation operation, to which the primary building blocks are hence dedicated. The left side of Fig. 3 shows the structure of a, which can be used in a flexible manner due to the multiplexers between the operators. This architecture allows independent multiplication and addition as well as vectorized multiply-accumulate operations. In order to provide a sustained performance close to peak performance a dedicated accumulator is used in addition to the adder. Our floating-point accumulator separately accumulates positive and negative values. Therefore a final add must be performed at the end of an accumulation, wherefore the loop-back from the accumulator output to the inputs of the adder is routed. Basic linear algebra operations are done automatically, controlled by the ATC unit. The data flow is controlled by the Operand Fetch and Result Dispatch units. The operations are performed on local memory (internal FPGA dual-port block-ram). The attached processor core cares for data management and processing control by assembling data in the memory and triggering operations for the ATC unit. sing a processor for this purpose provides the flexibility to apply enhanced factorization algorithms like the multifrontal method. We propose to use the XIINX MicroBlaze soft-processor. With an on-chip-memory interface, multiple special point-to-point communication channels to attached circuits and an on-chip peripheral bus to connect to the environment, this core provides all necessary interfaces to assemble the s as drawn.

5 At the right side of Fig. 3 the overall architecture of the proposed FPGA design is shown. Several units share a single divider as this operation is rarely needed. The s are connected with each other and the off-chip memory as well as with the host interface via the onchip peripheral bus of the processor cores. The number m of s which share a divider is a parameter and depends on the target application, where m=8 would be a reasonable choice. The detailed design of the interconnection between s and the host and memory interfaces would depend on the respective reconfigurable platform. I/O Dual Port Mem Processor Core Operand Fetch x + ACC Result Dispatch inear Algebra Task Control (ATC) 0 1/x m-1 Off-Chip Memory Interface n-m 1/x n-1 Host Interface Processing Element () Fig. 3. Proposed processing element (left) and overall architecture (right). 4 Performance analysis With the above outlined design we can now estimate the performance achievable in FEM processing. A single calculation unit consisting of a multiplier, an adder and an accumulator consumes about 1000 slices and 4 DSP48 cells. The MicroBlaze soft-processor adds about 500 slices and we can charge another 200 slices for control and data management. Altogether a will require about 1700 slices. Therefore the mentioned FPGA XCVFX140 may house 32 processing elements including 8 single-precision dividers. The on-chip memory allows local dual-port buffers of 38 KBytes per. Taking into account, that a significant fraction of this memory is used for instruction memory for the processor core, this amount of memory limits a to handle submatrices with a size of up to about 70x70. Calculations with larger matrices need to be divided among more than one. Assuming a conservative clock frequency of 160 MHz and two concurrent floating-point operations per we get a peak performance of 10 GFOPs on a single FPGA. Estimating an efficiency of more than 50 %, which is reasonable for custom computing architectures, we get a sustained performance of about 5 GFOPs. Assuming 300 MFOPs for a general-purpose CP, which is realistic as we usually see an efficency of less than 10 % for the solution of sparse systems, we can expect a speedup of When going to double-precision a single processing element would consume about 3300 slices and 9 DSP48 cells. Therefore only 16 s will fit on the same FPGA. Estimating the performance in the same way as above, a peak performance of 5 GFOPs and an estimated speedup of 5-10 results. About 25 % of logic ressources could be saved by using the adder in the also for accumulation. Because of the latency of the adder, performing the multiply-accumulate operations in an efficient way would become much more complex and would require an additional feedback FIFO between the output of the adder and one of its inputs. Then it was also more difficult to get close to peak performance, i.e. the algorithm had to allow a much higher parallelism in order to hide the adder latency. But such a design would enable a 48-fold parallelization with a peak performance of more than 15 GFOPs.

6 5 Conclusions We demonstrated the potential of reconfigurable computing for solving the linear equations from FEM problems. A dedicated FPGA design consisting of multiple entities of equal processing elements was outlined and the resulting performance of 10 GFOPs for singleprecision estimated. Because of the integrated processor cores the architecture is capable of dealing with different algorithms for decomposition. The design should also provide good performance for other solvers like those based on the method of conjugate gradient (iterative) or algebraic multigrid solvers. The efficiency and therefore the overall speedup depends on how efficiently the underlying algorithms can be parallelized to perform independent basic linear algebra operations on local memory. For FPGA computing machines this is a major topic of research and the prospects of computation power for number crunching strongly motivates further work in this direction. References Belanovic, P.; eeser, M. (2002) A ibrary of Parameterized Floating-Point Modules and Their se. Proc. FP'02, pp Bro-Nielsen, M. (1998) Finite element modeling in surgery simulation. Proc. of the IEEE 86(3): Choi, S.; Prasanna, V.K. (2003) Time and Energy Efficient Matrix Factorization using FPGAs. Proc. of FP'03. Compton, K.; Hauck, S. (2002) Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34(2): Davis, T.A.; Duff, I. (1997) An nsymmetric-pattern Multifrontal Method for Sparse Factorization. SIAM J. Matrix Anal. Appl. 18(1): Deorimier, M.; DeHon, A. (2005) Floating-Point Sparse Matrix-Vector Multiply for FPGAs. Proc. FPGA'05. Hamada, T.; Nakasato, N (2005) PGR: A Software Package for Reconfigurable Super-Computing, Proc. FP'05, pp Margetts,.; Smethurst, C.; Ford, R. (2005) Interactive Finite Element Analysis. NAFEMS World Congress, Malta, May Johnson, J.R.; Nagvarjara, P.; Nwankpa, C. (2004) Sparse inear Solver for Power System Analysis using FPGA. Proc. HC'04. Heath, M.T. (1997) Parallel direct methods for sparse linear systems. Parallel Numerical Algorithms, Kluwer, Boston, pp ienhart, G.; Kugel, A.; Männer, R. (2002) sing Floating Point Arithmetic on FPGAs for Accelerating Scientific N-Body Simulations. Proc. FCCM'02. iu, J. (1992) The Multifrontal Method for Sparse Matrix Solution: Theory and Practice. SIAM Review 34: Rhomberg, A.; Enzler, R.; Thaler, M.; Tröster, G. (1998) Design of a FEM Computation Engine for Real-Time aparoscopic Surgery Simulation. Proc. of the IPPS/SPDP, pp Roesler, E.; Nelson, B.E. (2002) Novel Optimizations for Hardware Floating-Point nits in a Modern FPGA Architecture. Proc. FP'02, pp nderwood, K. (2004) FPGAs vs. CPs: Trends in Peak Floating-Point Performance, K. nderwood, Proc. FPGA'04. nderwood, K.; Hemmert K.S. (2004) Closing the Gap: CP and FPGA Trends in Sustainable Floating-Point BAS Performance. Proc. FCCM'04. Wang, X.; Ziavras, S.G. (2004) Parallel factorization of sparse matrices on FPGA-based configurable computing engines. Concurrency Computat.: Pract. Exper. 16: XIINX Inc. (2005) Floating-Point Operator V1.0 Product Specification. DS335: XIINX Inc. (2005) MicroBlaze Processor Reference Guide. Zhuo,.; Prasanna, V.K. (2005) Sparse Matrix-Vector Multiplication on FPGAs. Proc. FPGA'05. Xu, X.; Ziavras, S.G. (2003) Iterative Methods for Solving inear Systems of equations on FPGA- Based Machines. Proc. Computers and their Applications, pp

Sparse LU Decomposition using FPGA

Sparse LU Decomposition using FPGA Jeremy Johnson 1, Timothy Chagnon 1, Petya Vachranukunkiet 2, Prawat Nagvajara 2, and Chika Nwankpa 2 CS 1 and ECE 2 Departments Drexel University, Philadelphia, PA jjohnson@cs.drexel.edu,tchagnon@drexel.edu,pv29@drexel.edu,