High-Performance Linear Algebra Processor using FPGA

Similar documents
Sparse Linear Solver for Power System Analyis using FPGA

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

An Update on CORBA Performance for HPEC Algorithms. Bill Beckwith Objective Interface Systems, Inc.

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

Requirements for Scalable Application Specific Processing in Commercial HPEC

Distributed Real-Time Embedded Video Processing

Washington University

The extreme Adaptive DSP Solution to Sensor Data Processing

Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT

High-Assurance Security/Safety on HPEC Systems: an Oxymoron?

Data Reorganization Interface

Multi-Modal Communication

Sparse LU Decomposition using FPGA

MODELING AND SIMULATION OF LIQUID MOLDING PROCESSES. Pavel Simacek Center for Composite Materials University of Delaware

Considerations for Algorithm Selection and C Programming Style for the SRC-6E Reconfigurable Computer

PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1

Setting the Standard for Real-Time Digital Signal Processing Pentek Seminar Series. Digital IF Standardization

Parallelization of a Electromagnetic Analysis Tool

Service Level Agreements: An Approach to Software Lifecycle Management. CDR Leonard Gaines Naval Supply Systems Command 29 January 2003

ARINC653 AADL Annex Update

Empirically Based Analysis: The DDoS Case

4. Lessons Learned in Introducing MBSE: 2009 to 2012

Using Model-Theoretic Invariants for Semantic Integration. Michael Gruninger NIST / Institute for Systems Research University of Maryland

DoD Common Access Card Information Brief. Smart Card Project Managers Group

Edwards Air Force Base Accelerates Flight Test Data Analysis Using MATLAB and Math Works. John Bourgeois EDWARDS AFB, CA. PRESENTED ON: 10 June 2010

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

Running CyberCIEGE on Linux without Windows

FUDSChem. Brian Jordan With the assistance of Deb Walker. Formerly Used Defense Site Chemistry Database. USACE-Albuquerque District.

The C2 Workstation and Data Replication over Disadvantaged Tactical Communication Links

New Dimensions in Microarchitecture Harnessing 3D Integration Technologies

Balancing Transport and Physical Layers in Wireless Ad Hoc Networks: Jointly Optimal Congestion Control and Power Control

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

NEW FINITE ELEMENT / MULTIBODY SYSTEM ALGORITHM FOR MODELING FLEXIBLE TRACKED VEHICLES

FPGA Acceleration of Information Management Services

Advanced Numerical Methods for Numerical Weather Prediction

CENTER FOR ADVANCED ENERGY SYSTEM Rutgers University. Field Management for Industrial Assessment Centers Appointed By USDOE

2013 US State of Cybercrime Survey

LARGE AREA, REAL TIME INSPECTION OF ROCKET MOTORS USING A NOVEL HANDHELD ULTRASOUND CAMERA

Information, Decision, & Complex Networks AFOSR/RTC Overview

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Space and Missile Systems Center

Kathleen Fisher Program Manager, Information Innovation Office

Stream Processing for High-Performance Embedded Systems

Lessons Learned in Adapting a Software System to a Micro Computer

SINOVIA An open approach for heterogeneous ISR systems inter-operability

Topology Control from Bottom to Top

Advanced Numerical Methods for Numerical Weather Prediction

Abstract. TD Convolution. Introduction. Shape : a new class. Lenore R. Mullin Xingmin Luo Lawrence A. Bush. August 7, TD Convolution: MoA Design

Use of the Polarized Radiance Distribution Camera System in the RADYO Program

ASPECTS OF USE OF CFD FOR UAV CONFIGURATION DESIGN

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

A Distributed Parallel Processing System for Command and Control Imagery

Dr. ing. Rune Aasgaard SINTEF Applied Mathematics PO Box 124 Blindern N-0314 Oslo NORWAY

ASSESSMENT OF A BAYESIAN MODEL AND TEST VALIDATION METHOD

Web Site update. 21st HCAT Program Review Toronto, September 26, Keith Legg

32nd Annual Precise Time and Time Interval (PTTI) Meeting. Ed Butterline Symmetricom San Jose, CA, USA. Abstract

David W. Hyde US Army Engineer Waterways Experiment Station Vicksburg, Mississippi ABSTRACT

SETTING UP AN NTP SERVER AT THE ROYAL OBSERVATORY OF BELGIUM

Vision Protection Army Technology Objective (ATO) Overview for GVSET VIP Day. Sensors from Laser Weapons Date: 17 Jul 09 UNCLASSIFIED

Title: An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications 1

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

Detection, Classification, & Identification of Objects in Cluttered Images

A Review of the 2007 Air Force Inaugural Sustainability Report

Towards a Formal Pedigree Ontology for Level-One Sensor Fusion

C2-Simulation Interoperability in NATO

Data Warehousing. HCAT Program Review Park City, UT July Keith Legg,

Concept of Operations Discussion Summary

Wireless Connectivity of Swarms in Presence of Obstacles

ATCCIS Replication Mechanism (ARM)

Increased Underwater Optical Imaging Performance Via Multiple Autonomous Underwater Vehicles

Technological Advances In Emergency Management

ENVIRONMENTAL MANAGEMENT SYSTEM WEB SITE (EMSWeb)

Architecting for Resiliency Army s Common Operating Environment (COE) SERC

U.S. Army Research, Development and Engineering Command (IDAS) Briefer: Jason Morse ARMED Team Leader Ground System Survivability, TARDEC

SCIENTIFIC VISUALIZATION OF SOIL LIQUEFACTION

Agile Modeling of Component Connections for Simulation and Design of Complex Vehicle Structures

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Shallow Ocean Bottom BRDF Prediction, Modeling, and Inversion via Simulation with Surface/Volume Data Derived from X-Ray Tomography

SURVIVABILITY ENHANCED RUN-FLAT

Comparative Performance Analysis of Parallel Beamformers

Space and Missile Systems Center

Ross Lazarus MB,BS MPH

Accuracy of Computed Water Surface Profiles

Data Replication in the CNR Environment:

Development and Test of a Millimetre-Wave ISAR Images ATR System

Polarized Downwelling Radiance Distribution Camera System

DEVELOPMENT OF A NOVEL MICROWAVE RADAR SYSTEM USING ANGULAR CORRELATION FOR THE DETECTION OF BURIED OBJECTS IN SANDY SOILS

75th Air Base Wing. Effective Data Stewarding Measures in Support of EESOH-MIS

ENHANCED FRAGMENTATION MODELING

Corrosion Prevention and Control Database. Bob Barbin 07 February 2011 ASETSDefense 2011

Rockwell Science Center (RSC) Technologies for WDM Components

The Last Word on TLE. James Brownlow AIR FORCE TEST CENTER EDWARDS AFB, CA May, Approved for public release ; distribution is unlimited.

DoD M&S Project: Standardized Documentation for Verification, Validation, and Accreditation

Report Documentation Page

3-D Forward Look and Forward Scan Sonar Systems LONG-TERM-GOALS OBJECTIVES APPROACH

VICTORY VALIDATION AN INTRODUCTION AND TECHNICAL OVERVIEW

REPORT DOCUMENTATION PAGE

Current Threat Environment

The Interaction and Merger of Nonlinear Internal Waves (NLIW)

Transcription:

High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible to use these devices to build special purpose processors for floating point intensive applications that arise in scientific computing. FPGA provides programmable hardware that can be used to design custom hardware without the high-cost of traditional hardware design. In this talk we discuss two multi-processor designs using FPGA for basic linear algebra computations such as matrix multiplication and LU factorization. The first design is a purely hardware solution for dense matrix computations, and the second design uses a hardware/software solution for sparse matrix computations. The hardware solution uses the regular structure available in dense linear algebra computations to design custom processors with hard-wired communication patterns. The hardware/software solution uses embedded processors with the flexibility to program the irregular communication patterns required by sparse matrix computations. The dense matrix processor utilizes a distributed memory architecture connected in a ring topology, with hardwired control for communication. Each processing element consists of pipelined multiply-accumulate hardware, and local memory to store part of the input and output matrices. This work was partially supported by DOE grant #ER63384, PowerGrid - A Computation Engine for Large-Scale Electric Networks Department of Computer Science, Drexel University, Philadelphia, PA 19104. email: jjohnson@cs.drexel.edu Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104. email: nagvajara@ece.drexel.edu Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104. email: chika@nwankpa.ece.drexel.edu 1

Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 20 AUG 2004 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE High-Performance Linear Algebra Processor using FPGA 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Drexel University 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited 11. SPONSOR/MONITOR S REPORT NUMBER(S) 13. SUPPLEMENTARY NOTES See also ADM001694, HPEC-6-Vol 1 ESC-TR-2003-081; High Performance Embedded Computing (HPEC) Workshop (7th)., The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified 18. NUMBER OF PAGES 19 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Extra buffers are available to provide overlapped communication and computation so that while a computation is being performed the next inputs can be downloaded from the host computer. This allows the FPGA to be used as a hardware accelerator as part of a larger matrix computation using various block algorithms. In fact, the matrix processor supports BLAS routines such as GEMM (http://www.netlib.org/blas/). The sparse matrix processor also uses a distributed memory architecture, however, it uses embedded processor cores which execute parallel programs written in C. A ring interconnect was designed and special instructions were added to the processor cores to support interprocessor communication. In addition, hardware support for floating point computations were added which are accessible to the processor through extra instructions. A parallel algorithm for LU factorization was implemented. The resulting processor for sparse LU factorization is being designed for applications in power computations, where very large sparse systems arising from load-flow computation are solved as part of the Newton-Raphson method for solving a system of non-linear equations. These systems have problem specific structure which can be incorporated into the processor design. While the current investigation centers around this application, many other problems could benefit from a high-performance processor geared towards both sparse and dense matrix computations. In this work parameterized designs, allowing a scalable number of processors, were created using the hardware definition language VHDL. The resulting design was synthesized and testing using the cyclone development board from Altera (http://www.altera.com/). Embedded NIOS processors were used for the processor for sparse LU factorization. While this board allowed verification of the design, it does not have sufficient hardware resources to adequately demonstrate the potential performance. Current FPGA devices have built-in hardware multipliers and large amounts of on chip memory, in addition to a large number of programmable logic blocks. These resources are crucial to obtaining high-performance, especially for computations with floating point numbers. For example, the Xilinx Virtex 2 (XC2V8000), see www.xilinx.com, has 168 pipelined 18-bit multipliers which can be used to implement 42 single-precision multiply-accumulate units. Consequently a high-level performance model was created to estimate the performance using the high-end FPGA devices that are currently available and those that will be available in the next few years. The predicated performance was compared to efficient software solutions on highspeed workstations and clusters of workstations. Using clock speeds and the number of processors that can be implemented on the Virtex 2, a 2

400 400 matrix can be multiplied in 11ms, which corresponds to 11,570 MFLOPS. This compares to 10,000 MFLOPS obtained using a four processor 800 MHz Itanium processor with the Intel Math Kernel library (see http://www.intel.com/software/products/mkl/mkl52/specs.htm). A similar performance model was used for the sparse linear solver. Assuming eight processors running at 400MHz with an 80 MHz floating point unit, which is currently possible using Xilinx FPGAs with a floating point core sold from Digital Core Design (http://www.dcd.com.pl/), the linear system needed to solve the 7917-bus power system could be solved in.084 seconds as compared to.666 seconds using the WSMP sparse linear solver (http://www.research.ibm.com/math/opresearch/wsmp.html) on a 1.4GHz Pentium IV. While the times reported here are estimates based on a performance model, and hence should not be taken at face value, they show enough promise that the use of high-end FPGA devices, on a board with multiple FPGAs, to build a special-purpose linear algebra processor should be seriously investigated. 3

PowerGrid Hardware Power System Model Tie-Line Flows Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara,, Chika Nwankpa Drexel University Engineering Analysis Station Decision Making Station Economic Analysis Station Ultra-Fast Bus

Goal To design an embedded FPGA-based multiprocessor system to perform high speed Power Flow Analysis. To provide a single desktop environment to solve the entire package of Power Flow Problem (Multiprocessors on the Desktop). Provide a scalable solution to load flow computation. Deliver: Prototype and feasibility analysis.

Approach Utilize parallel algorithms for matrix operations needed in load flow. Utilize sparsity structure across contingencies. Use multiple embedded processors with problem specific instruction and interconnect. Scalable parameterized design. Pipelined solution for contingency analysis.

Dense Matrix Multiplier Distributed memory implementation Hardwired control for communication Parameterized number of processing elements (multiply and accumulate hardware) Overlap computation and communication Use block matrix algorithms with calls to FPGA for large problems Processor i stores A i, the ith block of rows of A and B j, the jth block of columns of B Compute C ij = A i * B j Rotate: send B j to processor (j+1) mod Number of processors

Processor Architecture FPGA Board Multiply/ Add unit Multiply/ Add unit Multiply/ Add unit Host PC PCI Embedded SRAM unit Embedded SRAM unit Embedded SRAM unit Interconnection Network (BUS) Off-chip RAM units

Performance estimate Xilinx Virtex 2 (XC2V8000) Performance 168 built-in in multipliers and on chip memories (SRAM) support for 42 single-precision processing elements 4.11 ns pipelined 18 X 18 bit multiplier 2.39 ns memory access time 7.26 ns for multiply accumulate Time for n n matrix multiply with p processors 7.26n 3 /p ns 11 ms for n=400, p = 42 11,570 MFLOPS

APPLICATION, ALGORITHM & SCHEDULE Application Linear solver in Power-flow solution for large systems Newton-Raphson loops for convergence, Jacobian matrix Algorithm & Schedule Pre-permute columns Sparse LU factorization, Forward and Backward Substitutions Schedule: Round Robin distribution of rows of Jacobian matrix according to the pattern of column permutation

Breakdown of Power-Flow Solution Implementation in Hardware HOST Problem Formulation (Serial) Data Input Extract Data Ybus Jacobian Matrix Backward Substitution Forward Substitution LU Factorization Problem Solution (Parallel) Mismatch < Accuracy NO Update Jacobian matrix YES Post Processing Post Processing (Serial) HOST

Minimum-Degree Ordering Algorithm (MMD) Reordering columns to reduce fill-ins while performing LU factorization Reduce floating point operations and storage Compute column permutation pattern once Apply throughout power-flow analysis for that set of input bus data Without MMD With MMD Div Mult Sub Div Mult Sub IEEE 30-bus 887 25806 25806 270 6535 6535 IEEE 118-bus 4509 340095 340095 719 55280 55280 IEEE 300-bus 31596 5429438 5429438 3058 640415 640415

RING TOPOLOGY SDRAM SRAM Memory Controller Memory Controller Processor 0 Buffer 0 Processor 1 Buffer 1 Buffer n-2 Processor n-1 RAM RAM RAM Buffer n-1 FPGA Nios Embedded Processors Buffers are on-chip memories; interprocessor communication

Hardware Model using Nios Processor Communication: used DMA for passing messages Buffers are on-chip memories Trapezoids are arbitrators Buffer in 1 Buffer out 2 DMA1 Processor 1 Processor 2 DMA2 Buffer out 1 Buffer in 2 SDRAM SRAM

Stratix FPGA Stratix 1S25F780C6 Resources LEs 25,660 M512 RAM blocks (32 x 18 bits) 224 M4K RAM blocks (128 x 36 bits) 138 M-RAM blocks (4K x 144 bits) 2 Total RAM bits 1,944,576 DSP Blocks 10 Embedded multipliers 80 PLLs 6

Floating Point Unit FADD Near Path Predict and Add Far Path Swap and Shift Near Path Leading Zero And Shift Far Path Add Select and Round FMUL Pre-Normalize Multiply Post-Normalize Round FDIV Pre-Normalize Newton Raphson ROM Lookup Reciprocal Iteration Multiply Round and Post-Normalize

IEEE-754 Support Floating Point Unit Single Precision Format, Round to Nearest Round Nearest Rounding, Denormalized Numbers, Special Numbers LEs MULs Fmax (MHz) Latency (clk cycles) Pipeline Rate (clk cycles) FADD 1088 0 91 3 1 FMUL 926 8 107 4 1 FDIV 1166 8 65 9 N/A NIOS + FPU 6134 63.88

Why? PERFORMANCE ANALYSIS Prototype is not intended for industrial performance Show potential performance as a function of available H/W resources Analysis performed for high-performance multi-processor system Estimate of number of clock cycles and timing Communication latency Arithmetic latency Model in MATLAB 80 MHz pipelined Floating-point Unit Variables Number of Processors (2, 4, 6, 8) Size of input data (power flow IEEE1648-bus, IEEE7917-bus) System constraints: memory access, size of FPGA chip, system speed

TIMING OF PERFORMANCE MODEL Running Time vs No of Processor for matrix size 2982x2982 from IEEE1648-bus Running Time vs No of Processor for matrix size 14508x14508 from IEEE7917-bus Running Time (ms) 250 200 150 100 50 0 0 5 10 Nios PowerPC Running Time (ms) 5000 4000 3000 2000 1000 0 0 2 4 6 8 10 Nios PowerPC No of Processor No of Processor Nios embedded processors on Altera Stratix FPGA, running at 80 MHz 8 processors: - 1648-bus: 64.179 ms - 7917-bus: 1106.676 ms 400 MHz PowerPC on Xilinx Virtex 2 FPGA with 80 MHz FPU 8 processors: -1648-bus: 18.992 ms -7917-bus: 256.382 ms WSMP 1.4 GHz, 256 KB Cache, 256 MB SDRAM, Linux OS - 1648-bus: 146.435 ms - 7917-bus: 666.487 ms