Adaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware. Objective/Approach/Process

Size: px

Start display at page:

Download "Adaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware. Objective/Approach/Process"

Jesse Craig
6 years ago
Views:

Adaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware John Zaino, Eric Pauer, Ken Smith, Paul Fiore, Jairam Ramanathan, Cory Myers

, paul.d.fiore, cory.s.myers}@lmco.com, pauer@bit-net.com, jramanat@alum.mit.

rable computing technology offers significant performance gains, e.g. 0X ops per watt and/or ops per cubic inch, over general purpose programmable solutions without the need to develop custom hardware.

algorithm designer and a hardware developer.

1 Adaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware John Zaino, Eric Pauer, Ken Smith, Paul Fiore, Jairam Ramanathan, Cory Myers {john.c.aino, ken.smith, paul.d.fiore, Fourth Biennial Ptolemy Miniconference March 200 Objective/Approach/Process Reconfigurable computing technology offers significant performance gains, e.g. 0X ops per watt and/or ops per cubic inch, over general purpose programmable solutions without the need to develop custom hardware. Today however, development of a working implementation requires hardware design expertise and generation of a good implementation requires many slow iterations between an algorithm designer and a hardware developer. Objective - reduce the design time for an initial implementation to hours and for an optimied implementation to days, for a range of signal processing applications Approach - provide the algorithm developer with tools to help analye algorithms, understand their implications for hardware, and rapidly implement their chosen solutions In the process, isolate the algorithm developer from the hardware designer through a set of library elements that provide well-defined interfaces to both communities Direct mapping of algorithm to adaptive computing system implementation. Automatic Implementation 03/07/0 Page - 2

2 Technical Attributes Development of Adaptive Computing Systems domain under Ptolemy Classic Allows alternative implementations from same dataflow graph Provides floating point simulation, fixed point simulation, C code generation and VHDL code generation Released first three versions of ACS domain in Ptolemy Classic End-to-end capability to map signal processing dataflow graph to working reconfigurable computing implementation Design space exploration automated Bit width optimiation theory (Markovian modeling) developed for algorithm analysis Bit width optimiation tool implemented to trade signal to noise ratio versus hardware complexity Pipeline alignment and scheduling algorithms implemented Automatically generate algorithm-specific sequencer and memory control logic Uni-rate and multi-rate signal processing Single and multi-fpga implementations Smart Generators- parameteriable algorithmic blocks 03/07/0 Page - 3 Analysis and Mapping in ACS Environment Dataflow Graph Bit Width Analysis Noise Distribution Analysis Precision Analysis Floating Point Simulation Fixed Point Simulation Algorithm Analysis Algorithm Rearrangement Alternative Implementations SNR analysis Alternative implementations Functional approximations Dataflow Graph Common Database in Ptolemy Automatic Scheduling Performance Metrics Performance Modeling Partitioning and Mapping Algorithm Mapping Timing and siing estimation Scheduling Partitioning across multiple FPGAs Allocated Functions Generator Selection Smart Generators Device program Interface program Device Programming VHDL Interface Libraries Adaptive Computing Resource 03/07/0 Page - 4

3 Algorithm Analysis Bit Width Analysis Noise Distribution Analysis Precision Analysis Algorithm Mapping Automatic Scheduling Performance Metrics Smart Generators Design Time Performance Modeling Partitioning & Mapping Allocated Functions Design Approach Dataflow Graph Floating Point Simulation Algorithm Rearrangement Fixed Point Simulation Alternative Implementations Signal Flow Graph Generator Selection Common Database in Ptolemy Signal Processing Algorithm Represent in Dataflow Ptolemy Environment Analysis and Simulation Hardware Configuration Library Application Interface Generation Run-Time Manager Operating System Device Driver Reconfigurable Hardware Run Time Application Software Compute Libraries Host Processor VHDL Interface Libraries Logic Generation Floorplanner Routing Device Program Legend Enhanced or New Capability Existing Tool or Hardware 03/07/0 Page - 5 Program Progress Algorithm analysis Representation for alternative implementations was incorporated as part of Ptolemy integration Side-by-side simulation capability incorporated as part of Ptolemy integration Developed bit width optimiation theory for algorithm analysis and extended to include multiple devices and constraints Implemented wordlength optimiation tool Algorithm mapping Cost analysis included as part of wordlength analysis Implemented uni-rate and multi-rate pipeline alignment and scheduling algorithm for signal processing dataflow graphs One-to-one and one-to-many mapping of functions to blocks supported 03/07/0 Page - 6

Program Progress Smart generators Implemented portable logic synthesis methodology with VHDL as first target Integrated Xilinx Core 4,000-series generators capability within VHDL code generation

998. Second release April 999. Third release in August 2000. ACS domain supports alternative implementations from a common interface.

Demonstration Selected Annapolis Micro Systems Wildforce TM board for demonstrations Established ACS demonstration environment for Solaris Integrated Wildforce TM board under Ptolemy Demonstrated

4 Program Progress Smart generators Implemented portable logic synthesis methodology with VHDL as first target Integrated Xilinx Core 4,000-series generators capability within VHDL code generation Implemented smart generators for state machine sequencer and memory control (address generator) Ptolemy integration Released initial version of Adaptive Computing Systems domain in Ptolemy in June 998. Second release April 999. Third release in August ACS domain supports alternative implementations from a common interface. Floating point simulation, fixed point simulation, C code generation, and VHDL code generation. Demonstration Selected Annapolis Micro Systems Wildforce TM board for demonstrations Established ACS demonstration environment for Solaris Integrated Wildforce TM board under Ptolemy Demonstrated Winograd-based FSIC receiver and FFT-based signal detector Procured and installed Annapolis Micro systems Wildstar TM board under Solaris SHARP/HRR (High Range Resolution Radar ATR) algorithm modeled - hardware development & testing nearing completion 03/07/0 Page - 7 ACS Domain Determined that extending old domains could not be justified New paradigm for Ptolemy, e.g. multiple implementations of a single star 03/07/0 Page - 8

ACS Domain New ACS domain to facilitate movement among simulation and code/design generation Corona contains interface specification Core contains an implementation ACS Stars are

C Code Generation Core FPGA Design Generation Core 03/07/0 Page - 9 Selecting Among Alternative Implementations Alternative implementations are represented as targets with cores

5 ACS Domain New ACS domain to facilitate movement among simulation and code/design generation Corona contains interface specification Core contains an implementation ACS Stars are composed of one corona and multiple cores Core selection via targeting defines implementation Corona Core Targets Corona Floating_Point Simulation Core Fixed_Point Simulation Core C Code Generation Core FPGA Design Generation Core 03/07/0 Page - 9 Selecting Among Alternative Implementations Alternative implementations are represented as targets with cores for each star/functional block Targets can have parameters Floating point simulation, fixed point simulation, C code generation, and FPGA design generation are available. 03/07/0 Page - 0

Yn+=a0 Xn++a Xn +a2xn- P() Loop Filter Quanti e Algorithm Analysis I Q Angle N

Representations N Mults N Adds Freq. Est.

Basic FIR Systolic Perform Trade-Offs Precision (float vs.

Power Acc Acc2 Acc3 Y n =a 0 X n +a X n- +a 2 X n-2 Y n+2 =a 0 X n+2 +a X n+ +a 2

6 Yn+=a0 Xn++a Xn +a2xn- P() Loop Filter Quanti e Algorithm Analysis I Q Angle N DELAYS Algorithm Analysis I Q Angle N/3 DELAYS 2N/3 DELAYS Multiple Representations N Mults N Adds Freq. Est. Scaling 7 Adds Freq. Est. Basic FIR Systolic Perform Trade-Offs Precision (float vs. fixed, wordlengths) Speed Sie/Area Latency FA FA FA FA Bit Serial Coeffs Data Low Power Acc Acc2 Acc3 Y n =a 0 X n +a X n- +a 2 X n-2 Y n+2 =a 0 X n+2 +a X n+ +a 2 X n FA FA Reduce Taplength Reduce Wordlength Multirate Implementation Trades 03/07/0 Page - Wordlength Optimiation Analysis Dynamic Range Optimal Design Choices Quantiation Noise (SNR) Hardware Cost 03/07/0 Page - 2

7 Algorithm Mapping Objectives Performance Modeling provide feedback on utiliation, throughput, efficiency, etc. Feedback should be used by algorithm analysis capabilities. Partitioning and Mapping break large dataflow graphs into groups and map those groups across multiple devices and across time Automatic Scheduling automatically determine firing sequence, optimal mappings and sequence of configurations Progress Cost analysis included as part of wordlength analysis Implemented uni-rate and multi-rate pipeline alignment and scheduling algorithms Memory allocation support Signal Flow Graph Performance Modeling Common Database in Ptolemy Automatic Scheduling Performance Metrics Partitioning and Mapping Allocated Functions 03/07/0 Page - 3 Automatic Scheduling Input PORT N A N2 B C N3 I2 P=2 I3 P= I4 P= I = Instance N=Node P=Pipeline Delays N6 N8 N7 Pipeline alignment and schedule determination required for logic synthesis I5 P= I6 P= N4 D N5 E PORT2 MEM ADDED TO NETLIST BY SEQUENCER GENERATOR MODIFIED ALGORITHM DATAFLOW GRAPH Output LDEN N N6 I2 N4 LDEN2 P=2 I5 2-MUX N2 N7 P= I3 MEM2 DELAYN9 LDEN2 P= N5 I7 I6 SEL N3 P= P= I4 N8 P= THE ALGORITHM DATAFLOW GRAPH RAM BANK A B C FPGA DATAPATH AND VARIABLE LOCATIONS RAM BANK 2 D E Node Activation Sequence N N2 N3 N4 N5 N6 N7 N8 N9 SEL LD LD2 LD3 PORT PORT2 FINAL ALGORITHM SCHEDULE 03/07/0 Page - 4

8 Processing Model Well-matched to Ptolemy Synchronous Dataflow (SDF) Domain Unit or block token produce and consume amounts Netlist structure determines execution order constraints Pipeline delay information required to determine absolute timing Delays are set to align pipelines for maximum throughput Delay can be automatically determined from block parameters Combination of fully synchronous model and tagged synchronous models No handshaking or tags but data is not always valid Data validity is implicit in timing of latch signals Memory access fits same model Data from common memory demuxed into separate streams running at lower rate Data to common memory multiplexed to a single port Multiple FPGAs introduce additional pipeline delays Multi-rate parameteried execution 03/07/0 Page - 5 Smart Generators Objectives Parameteried libraries generate node implementations for specified bit widths and parameter values Hierarchical representations provide generators that can recursively call other generators Interface generation automatically generate software to move data between generalpurpose processor and reconfigurable platform and to manage sequences of configurations General synthesis provide device independent representation of implementation Progress Implemented portable logic synthesis methodology with VHDL as first target Integrated Xilinx Core Generators (4000 Series) capability within VHDL code generation Implemented smart generators for state machine and memory control Hierarchical generation Allocated Functions Common Database in Ptolemy Generator Selection VHDL Interface Libraries Device Programming Adaptive Computing Resource 03/07/0 Page - 6

9 Multi-FPGA Capability Design Generation for Single or Multiple FPGAs Single FPGA Implementation FPGA Logic Multi-FPGA Implementation FPGA Logic FPGA 3 Logic FPGA Routing FPGA 2 Routing FPGA 3 Routing FPGA 4 Routing 03/07/0 Page - 7 Winograd DFT-Based FSK Communications Receiver FPGA Implementation 03/07/0 Page - 8

Results from FPGA-target / Back-end Tools

Wildforce TM Star executes complete FPGA

10 Results from FPGA-target / Back-end Tools Generated VHDL Generated Schedule FPGA Design 03/07/0 Page - 9 Hardware-in-the-Loop SDF Galaxy SDF Wildforce TM Star executes complete FPGA design in hardware on Annapolis Wildforce FPGA board 03/07/0 Page - 20

Processing Results 03/07/0 Page - 2 SHARP*/HRR Algorithm Test Vector Template Vector (one per target, per

templates are suitably pre-processed Algorithm Given test vector For each template For each shift Compute

Automatic Recognition Program Complexity 70 data points per vector Number of shifts = (in range) Number

11 Processing Results 03/07/0 Page - 2 SHARP*/HRR Algorithm Test Vector Template Vector (one per target, per aimuth, per elevation) Non- Linearity Shift Least Squares Fit Modeling Error Can be done with correlation if templates are suitably pre-processed Algorithm Given test vector For each template For each shift Compute least squares error Select template with minimum error * System-oriented High Range Resolution (HRR) Automatic Recognition Program Complexity 70 data points per vector Number of shifts = (in range) Number templates = 3,600/class 86 sec/class for shifts on a Sun Ultra 5 (360 MHZ) workstation Expect 30x improvement 03/07/0 Page - 22

SHARP/HRR Algorithm NORMALIZATION CORRELATION Schedule

SW) Correlation Results Across Range Shifts (typical

23 Functional Blocks (ACS stars) developed ~00 lines of

supported Wildforce TM (4062XL) Wildstar TM (XCV000) (in

generate VHDL for five FPGA design Explore ~000 bit-width

12 SHARP/HRR Algorithm NORMALIZATION CORRELATION Schedule FPGA Design Schedule FPGA Design Normaliation Results ( vs. SW) Correlation Results Across Range Shifts (typical expected) 03/07/0 Page - 23 ACS Tools - Facts and Figures 23 Functional Blocks (ACS stars) developed ~00 lines of code needed for new block/star Two ACS Architectures supported Wildforce TM (4062XL) Wildstar TM (XCV000) (in progress) ~6,000 lines of C++ code developed ~5 min to generate VHDL for five FPGA design Explore ~000 bit-width combinations in minute Ptolemy Classic runs under Solaris and Linux 03/07/0 Page - 24

Algorithm Analysis and Mapping Environment for Adaptive Computing Systems. Statement of the Problem

Algorithm Analysis and Mapping Environment for Adaptive Computing Systems Eric Pauer, Cory Myers, Ken Smith, and Paul Fiore {pauer,cory,jmsmith,pfiore}@sanders.com Sanders, a Lockheed Martin Company Nashua,