Sparse Linear Solver for Power System Analyis using FPGA

Similar documents
High-Performance Linear Algebra Processor using FPGA

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

Requirements for Scalable Application Specific Processing in Commercial HPEC

High-Assurance Security/Safety on HPEC Systems: an Oxymoron?

An Update on CORBA Performance for HPEC Algorithms. Bill Beckwith Objective Interface Systems, Inc.

Sparse LU Decomposition using FPGA

PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1

4. Lessons Learned in Introducing MBSE: 2009 to 2012

Washington University

Distributed Real-Time Embedded Video Processing

Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:

Setting the Standard for Real-Time Digital Signal Processing Pentek Seminar Series. Digital IF Standardization

The extreme Adaptive DSP Solution to Sensor Data Processing

Multi-Modal Communication

Service Level Agreements: An Approach to Software Lifecycle Management. CDR Leonard Gaines Naval Supply Systems Command 29 January 2003

MODELING AND SIMULATION OF LIQUID MOLDING PROCESSES. Pavel Simacek Center for Composite Materials University of Delaware

ARINC653 AADL Annex Update

Empirically Based Analysis: The DDoS Case

DoD Common Access Card Information Brief. Smart Card Project Managers Group

Using Model-Theoretic Invariants for Semantic Integration. Michael Gruninger NIST / Institute for Systems Research University of Maryland

Parallelization of a Electromagnetic Analysis Tool

Running CyberCIEGE on Linux without Windows

Kathleen Fisher Program Manager, Information Innovation Office

New Dimensions in Microarchitecture Harnessing 3D Integration Technologies

The C2 Workstation and Data Replication over Disadvantaged Tactical Communication Links

FUDSChem. Brian Jordan With the assistance of Deb Walker. Formerly Used Defense Site Chemistry Database. USACE-Albuquerque District.

Balancing Transport and Physical Layers in Wireless Ad Hoc Networks: Jointly Optimal Congestion Control and Power Control

Data Reorganization Interface

A Review of the 2007 Air Force Inaugural Sustainability Report

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

Space and Missile Systems Center

2013 US State of Cybercrime Survey

Title: An Integrated Design Environment to Evaluate Power/Performance Tradeoffs for Sensor Network Applications 1

Lessons Learned in Adapting a Software System to a Micro Computer

Architecting for Resiliency Army s Common Operating Environment (COE) SERC

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Vision Protection Army Technology Objective (ATO) Overview for GVSET VIP Day. Sensors from Laser Weapons Date: 17 Jul 09 UNCLASSIFIED

The Evaluation of GPU-Based Programming Environments for Knowledge Discovery

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

CENTER FOR ADVANCED ENERGY SYSTEM Rutgers University. Field Management for Industrial Assessment Centers Appointed By USDOE

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

COTS Multicore Processors in Avionics Systems: Challenges and Solutions

ASPECTS OF USE OF CFD FOR UAV CONFIGURATION DESIGN

NEW FINITE ELEMENT / MULTIBODY SYSTEM ALGORITHM FOR MODELING FLEXIBLE TRACKED VEHICLES

SINOVIA An open approach for heterogeneous ISR systems inter-operability

Information, Decision, & Complex Networks AFOSR/RTC Overview

Topology Control from Bottom to Top

FPGA Acceleration of Information Management Services

SETTING UP AN NTP SERVER AT THE ROYAL OBSERVATORY OF BELGIUM

Considerations for Algorithm Selection and C Programming Style for the SRC-6E Reconfigurable Computer

David W. Hyde US Army Engineer Waterways Experiment Station Vicksburg, Mississippi ABSTRACT

Dr. Stuart Dickinson Dr. Donald H. Steinbrecher Naval Undersea Warfare Center, Newport, RI May 10, 2011

Advanced Numerical Methods for Numerical Weather Prediction

Wireless Connectivity of Swarms in Presence of Obstacles

Accuracy of Computed Water Surface Profiles

ASSESSMENT OF A BAYESIAN MODEL AND TEST VALIDATION METHOD

Shallow Ocean Bottom BRDF Prediction, Modeling, and Inversion via Simulation with Surface/Volume Data Derived from X-Ray Tomography

Dr. ing. Rune Aasgaard SINTEF Applied Mathematics PO Box 124 Blindern N-0314 Oslo NORWAY

Use of the Polarized Radiance Distribution Camera System in the RADYO Program

Web Site update. 21st HCAT Program Review Toronto, September 26, Keith Legg

ATCCIS Replication Mechanism (ARM)

ENVIRONMENTAL MANAGEMENT SYSTEM WEB SITE (EMSWeb)

Towards a Formal Pedigree Ontology for Level-One Sensor Fusion

A Distributed Parallel Processing System for Command and Control Imagery

Concept of Operations Discussion Summary

75th Air Base Wing. Effective Data Stewarding Measures in Support of EESOH-MIS

LARGE AREA, REAL TIME INSPECTION OF ROCKET MOTORS USING A NOVEL HANDHELD ULTRASOUND CAMERA

Detection, Classification, & Identification of Objects in Cluttered Images

THE NATIONAL SHIPBUILDING RESEARCH PROGRAM

Space and Missile Systems Center

SURVIVABILITY ENHANCED RUN-FLAT

INTEGRATING LOCAL AND GLOBAL NAVIGATION IN UNMANNED GROUND VEHICLES

32nd Annual Precise Time and Time Interval (PTTI) Meeting. Ed Butterline Symmetricom San Jose, CA, USA. Abstract

C2-Simulation Interoperability in NATO

Exploring the Query Expansion Methods for Concept Based Representation

Edwards Air Force Base Accelerates Flight Test Data Analysis Using MATLAB and Math Works. John Bourgeois EDWARDS AFB, CA. PRESENTED ON: 10 June 2010

Data Warehousing. HCAT Program Review Park City, UT July Keith Legg,

Introducing I 3 CON. The Information Interpretation and Integration Conference

Using Templates to Support Crisis Action Mission Planning

Technological Advances In Emergency Management

MODFLOW Data Extractor Program

Virtual Prototyping with SPEED

A KASSPER Real-Time Signal Processor Testbed

U.S. Army Research, Development and Engineering Command (IDAS) Briefer: Jason Morse ARMED Team Leader Ground System Survivability, TARDEC

Data Replication in the CNR Environment:

75 TH MORSS CD Cover Page. If you would like your presentation included in the 75 th MORSS Final Report CD it must :

Col Jaap de Die Chairman Steering Committee CEPA-11. NMSG, October

Simulation of Routing Protocol with CoS/QoS Enhancements in Heterogeneous Communication Network

PEO C4I Remarks for NPS Acquisition Research Symposium

Corrosion Prevention and Control Database. Bob Barbin 07 February 2011 ASETSDefense 2011

Monte Carlo Techniques for Estimating Power in Aircraft T&E Tests. Todd Remund Dr. William Kitto EDWARDS AFB, CA. July 2011

Secure FAST: Security Enhancement in the NATO Time Sensitive Targeting Tool

Abstract. TD Convolution. Introduction. Shape : a new class. Lenore R. Mullin Xingmin Luo Lawrence A. Bush. August 7, TD Convolution: MoA Design

DoD M&S Project: Standardized Documentation for Verification, Validation, and Accreditation

Ross Lazarus MB,BS MPH

Stream Processing for High-Performance Embedded Systems

Computer Aided Munitions Storage Planning

Polarized Downwelling Radiance Distribution Camera System

Transcription:

Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis. Numerical solution to load flow equations are typically computed using Newton-Raphson iteration, and the most time consuming component of the computation is the solution of a sparse linear system needed for the update each iteration. When an appropriate elimination ordering is used, direct solvers are more effective than iterative solvers. In practice these systems involve a larger number of variables (50,000 or more); however, when the sparsity is utilized effectively these systems can be solved in a modest amount of time (seconds). Despite the modest computation time for the linear solver, the number of systems that must be solved is large and current computation platforms and approaches do not yield the desired performance. Because of the relatively small granularity of the linear solver, the use of a coarse-grained parallel solver does not provide an effective means to improve performance. In this talk, it is argued that a hardware solution, implemented in FPGA, using fine-grained parallelism, provides a cost-effective means to achieve the desired performance. Previous work [1, 2, 3] has shown that FPGA can be effectively used for floating-point intensive scientific computation. It was shown that high MFLOP rates could be achieved by utilizing multiple floating-point units, This work was partially supported by DOE grant #ER63384, PowerGrid - A Computation Engine for Large-Scale Electric Networks Department of Computer Science, Drexel University, Philadelphia, PA 19104. email: jjohnson@cs.drexel.edu Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104. email: nagvajara@ece.drexel.edu Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104. email: chika@nwankpa.ece.drexel.edu 1

Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 01 FEB 2005 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE Sparse Linear Solver for Power System Analysis using FPGA 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Drexel University 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited 11. SPONSOR/MONITOR S REPORT NUMBER(S) 13. SUPPLEMENTARY NOTES See also ADM00001742, HPEC-7 Volume 1, Proceedings of the Eighth Annual High Performance Embedded Computing (HPEC) Workshops, 28-30 September 2004 Volume 1., The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified 18. NUMBER OF PAGES 21 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

and FPGA could outperform PCs and workstations, running at much higherlock rates, on dense matrix computations. The current work argues that similar benefit can be obtained for the sparse matrix computations arising in power system analysis. These conclusions are based on operation counts and system analysis for a collection of benchmark systems arising in practice. Benchmark data indicates that between 1 and 3 percent of peak floating point performance was obtained using a state-of-the-art sparse solver (UMF- PACK) running on 2.60 GHz Pentium 4. The solve time for the largest system (50,092 unknowns and 361,530 non-zero entries) was 1.39 seconds. A pipelined floating point core was designed for the Altera Stratix family of FPGAs. An instantiation of the core on an Altera Stratix with speed rating (-5) operates at 200 MHz for addition and multiplication and 70 MHz for divion. Moreover, there is sufficient room for 10 units. Assuming 100% utilization of eight FPUs, the projected performance for the FPGA implementation is 0.069 seconds, which provides a 20-fold improvement. While it is optimistic to assume perfect efficiency, hard-wired control should provide substantially better efficiency than available with a standard processor. Moreover, analysis of the LU factorization shows that the average number of updates per row throughout the factorization is 20.3, which provides sufficient parallelism to benefit from 8 FPUs. An implementation, and more detailed model, is being carried out to determine the attainable efficiency. References [1] J. Johnson, P. Nagvajara, and C. Nwankpa. High-Performance Linear Algebra Processor using FPGA. In Proc. High Performance Embedded Computing (HPEC 2003). Sept. 22-24, 2003. [2] K. Underwood. FPGAs vs. CPUs: Trends in Peak Floating-Point Performance. In Proc. International Symposium on Field-Programmable Gate Arrays (FPGA 2004). Feb. 22-24, 2004. [3] Ling Zhuo and Viktor K. Prasanna. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs. In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004. 2

Sparse Linear Solver for Power System Analysis using FPGA PowerGrid Hardware Power System Model Tie-Line Flows Jeremy Johnson, Prawat Nagvajara,, Chika Nwankpa Drexel University Engineering Analysis Station Decision Making Station Economic Analysis Station Ultra-Fast Bus

Goal & Approach To design an embedded FPGA-based multiprocessor system to perform high speed Power Flow Analysis. To provide a single desktop environment to solve the entire package of Power Flow Problem (Multiprocessors on the Desktop). Solve Power Flow equations using Newton- Raphson,, with hardware support for sparse LU. Tailor HW design to systems arising in Power Flow analysis.

Results Software solutions (sparse LU needed for Power Flow) using high-end PCs/workstations do not achieve efficient floating point performance and leave substantial room for improvement. High-grained grained parallelism will not significantly improve performance due to granularity of the computation. FPGA, with a much slower clock, can outperform PCs/workstations by devoting space to hardwired control, additional FP units, and utilizing fine-grained parallelism. Benchmarking studies show that significant performance gain is possible. A 10x speedup is possible using existing FPGA technology

Sparse Linear Solver for Power System Analysis using FPGA PowerGrid Hardware Power System Model Tie-Line Flows Jeremy Johnson, Prawat Nagvajara,, Chika Nwankpa Drexel University Engineering Analysis Station Decision Making Station Economic Analysis Station Ultra-Fast Bus

Goal & Approach To design an embedded FPGA-based multiprocessor system to perform high speed Power Flow Analysis. To provide a single desktop environment to solve the entire package of Power Flow Problem (Multiprocessors on the Desktop). Solve Power Flow equations using Newton- Raphson,, with hardware support for sparse LU. Tailor HW design to systems arising in Power Flow analysis.

Algorithm and HW/SW Partition HOST Data Input Extract Data Ybus Jacobian Matrix Backward Substitution Forward Substitution LU Factorization Mismatch < Accuracy NO Update Jacobian matrix YES Post Processing HOST

Results Software solutions (sparse LU needed for Power Flow) using high-end PCs/workstations do not achieve efficient floating point performance and leave substantial room for improvement. High-grained grained parallelism will not significantly improve performance due to granularity of the computation. FPGA, with a much slower clock, can outperform PCs/workstations by devoting space to hardwired control, additional FP units, and utilizing fine-grained parallelism. Benchmarking studies show that significant performance gain is possible. A 10x speedup is possible using existing FPGA technology

Benchmark Obtain data from power systems of interest Source # Bus Branches/Bus Order NNZ PSSE 1,648 1.58 2,982 21,682 PSSE 7,917 1.64 14,508 108,024 PJM 10,278 1.42 19,285 137,031 PJM 26,829 1.43 50,092 361,530

System Profile # Bus # Iter #DIV #MUL #ADD NNZ L+U 1,648 6 43,876 1,908,082 1,824,380 108,210 7,917 9 259,388 18,839,382 18,324,787 571,378 10,279 12 238,343 14,057,766 13,604,494 576,007 26,829 770,514 90,556,643 89,003,926 1,746,673

System Profile More than 80% of rows/cols have size < 30 1200 1200 1000 1000 800 800 600 600 400 400 200 0 0.0 110.0 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 Std. Dev = 21.15 Mean = 20.3 N = 2981.00 200 0 Std. Dev = 15.68 Mean = 16.6 N = 2981.00 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0 ROW_SIZE COL_SIZE

Software Performance Software platform UMFPACK Pentium 4 (2.6GHz) 8KB L1 Data Cache Mandrake 9.2 gcc v3.3.1 # Bus Time FP Eff 1,648 0.07 sec 1.05% 7,917 0.37 sec 1.33% 10,278 0.47 sec 0.96% 26,829 1.39 sec 3.45%

Hardware Model & Requirements Store row & column indices for non-zero entries Use column indices to search for pivot. Overlap pivot search and division by pivot element with row reads. Use multiple FPUs to do simultaneous updates (enough parallelism for 8 32, avg. col. size) Use cache to store updated rows from iteration to iteration (70% overlap, memory 400KB - largest). Can be used for prefetching. Total memory required 22MB (largest system)

Architecture SRAM Cache SDRAM Memory CACHE Controller SDRAM Controller Processing Logic FPGA

Pivot Hardware Physical index Pivot logic Read colmap Translate to virtual Index reject Memory read FP compare Pivot index Pivot value Virtual index Column value Pivot column Pivot

Parallel FPUs

Update Hardware Column Word Update Row FMUL FADD Memory read row Merge Logic Select colmap Update Write Logic Submatrix Row

Performance Model C program which simulates the computation (data transfer and arithmetic operations) and estimates the architecture s performance (clock cycles and seconds). Model Assumptions Sufficient internal buffers Cache write hits 100% Simple static memory allocation No penalty on cache write-back to SDRAM

Performance Speedup (P4) vs. # Update Units by Matrix Size Speedup 6.00 5.00 4.00 3.00 2.00 1.00 0.00 2 4 8 16 Jac2k Jac7k Jac10k Jac27k # Update Units

GEPP Breakdown Relative portion of LU Solve Time Single Update Unit Relative portion of LU Solve Time Quad Update Units 86% 10% 4% Pivot Divide Update 64% 26% 10% Pivot Divide Update Relative portion of LU Solve Time 16 Update Units Cycle Count By # of Update Units 1 4 16 41% 43% Pivot Divide Pivot 259401 259401 259401 16% Update Divide 96439 96439 96439 Update 2409312 642662 248295