High Performance DoD DSP Applications

Size: px

Start display at page:

Download "High Performance DoD DSP Applications"

Nora Wiggins
5 years ago
Views:

1 High Performance DoD DSP Applications Robert Bond Embedded Digital Systems Group 23 August 2003 Slide-1

2 Outline DoD High-Performance DSP Applications Middleware (with some streaming constructs) Future Directions Summary Slide-2

3 Embedded Signal Processing Requirements TFLOPS REQUIREMENTS INCREASING BY AN ORDER OF MAGNITUDE EVERY 5 YEARS Airborne Radar Shipboard Surveillance UAV Missile Seeker 0.01 SBR YEAR Small Unit Operations SIGINT EMBEDDED PROCESSING REQUIREMENTS WILL EMBEDDED PROCESSING REQUIREMENTS WILL REQUIRE TFOPS IN THE TIME FRAME REQUIRE TFOPS IN THE TIME FRAME hpec_01-1 Slide-1 SC2002 KT Tutorial

Embedded Processing Spectrum Applications vs.

Reconfigurable Computing With FPGAs ASIC -

Custom VLSI Multi-Chip Modules GOPS / cu. ft.

Reconfigurable Programmable 1 10 100 1,000

Airborne Shipboard Small Unit Operations

Processors (PSP) address a wide range of of

4 Embedded Processing Spectrum Applications vs. Digital Technology Programmable Processors Reconfigurable Computing With FPGAs ASIC - Standard Cell (Conventional Packaging) Full Custom VLSI Multi-Chip Modules GOPS / cu. ft. 100,000 10,000 1, Slide-3 Custom Reconfigurable Programmable ,000 10, ,000 MOPS / Watt Space Seeker UAV Airborne Shipboard Small Unit Operations SIGINT High performance Programmable Signal Processors (PSP) address a wide range of of applications Efficient Stream Processing is is the challenge!

5 Radar Stream Signal Processing Algorithms GMTI (STAP) Algorithm GMTI Data Flow Corner Turn Corner Turn Corner Turn SAR Algorithm SAR Data Flow Corner Turn Corner Turn GMTI Functional Pipeline SAR Functional Pipeline Data Input Beamformer Filtering (PC & DF) Detection ( STAP & CFAR ) Data Output Data Input Motion Compensation, Pulse Compression Polar Resampling 2D FFT Data Output Raw Input Data Scheduler RADAR Control + Timing Platform State Information Target Detections Sorted by Range & Doppler Raw Input Data Scheduler RADAR Control + Timing Platform State Information SAR Image(s) Synthetic Aperture Radar (SAR): ~ GOPS ~ s latency Ground Moving Target Indicator (GMTI) ~ GOPS ~ 0.1-1s latency Space-Time Adaptive Processing (STAP) ~ GOPS ~ s latency Slide-4

6 Missile Seeker Stream Video Processing 256x256 Focal Plane Array Integrator 60 FPS Guidance Background Estimation Adaptive Nonuniformity Correction Aimpoint Selection (PH) Dither Processing lmu Multi-Frame Processing Bulk Filter (PH) From other sensors CFAR Detection Latency ~ 1mS Target Selection (Discrimination) Data Fusion Feature Extraction Tracking Subpixel Processing Slide-5... Target update from surface based radar Stream Video Processing Back-end Processing Processing Block Opcount (MFLOPS) NUC Dither Subtraction Dempster-Shafer Plausibility Computation Total

7 Outline DoD High-Performance DSP Applications Middleware (with some streaming constructs) Some Key Features Future Directions Summary Slide-6

8 Mappable Stream Processing Decomposable into s (computations) and Conduits (communications) Beamform X OUT = w *X IN Conduit Filter X OUT = FIR(X IN ) Conduit Mappable to different sets of hardware Detect X OUT = X IN >c Measurable resource usage of each mapping Slide-7 Each compute stage can be mapped to different sets of hardware and timed

Parallel Vector Library Layered Architecture

Vector Library stream Grid F(s) Map Conduit

Communication Primitives Hardware Intel Cluster

9 Parallel Vector Library Layered Architecture Application Input Stream Processing Output Parallel Vector Library stream Grid F(s) Map Conduit Distribution Library API Library ISA Math Primitives Communication Primitives Hardware Intel Cluster Workstation PowerPC Cluster Embedded Board Embedded Multi-computer Slide-8

10 Parallel Vector Library Components Class Description Parallelism Stream Processing & Control Stream Vector/Matrix Function Conduit Streams organized as matrices or vectors that span multiple processors Performs signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR) Supports algorithm task decomposition (the nodes in a signal flow diagram) Hierarchical Supports stream movement between tasks (interconnection in stream flow graph) Data Data & & Pipeline & Pipeline Mapping Map Grid Slide-9 Specifies how s, Streams, and functions are distributed over processors Organizes processors into a 2D layout (virtual machine model) - Hierarchical Data, & Pipeline

11 Some Key Features The application stream flow is defined by tasks interconnected by conduits. Defines an application macro-architecture Presents opportunities for optimization at construction time s can contain other tasks and conduits Hierarchy helps to handle complex applications Conduits can reorganize data and handle buffering Corner turns, multi-rates, different granularities Streams flow through conduits and are transformed in tasks by functions. Streams are resizable with fundamental epochs defined by vector or matrix dimensions. They can be produced and consumed piecemeal. Function implementations specialize based on stream arguments (and maps.) All objects are mappable and hence implicitly parallel. Functionality is map invariant Scalable and portable (virtual machines) Re-mapping gives support for fault recovery and for automated performance tuning. Slide-10

12 Outline DoD High-Performance DSP Applications Middleware (stream language prototyping) Future Directions Summary Slide-11

13 Future Directions Standardization of parallel signal/image processing middleware: HPEC-Software Initiative (VSIPL++) Generative programming techniques Efficient implementation of multi-expression program blocks Breaking the abstraction barrier without using streams Extensions of middleware to heterogeneous platforms FPGAs, DSPs, Microprocessors Run-time optimal mapping Managing latency, throughput, power, Slide-12

14 Generative Programming Approach: Expression Templates + Comma Operator Problem: Abstraction Boundary Related expressions: Vector/Matrix expressions sharing common variables OUT = IN > THRESH THRESH = 0.9 * THRESH * IN Efficient implementation requires that data used in more than one expression remain in register/cache How to do this and maintain high-level API? Hand-coded for loop works, but is too low level Expression templates (PETE) can build a compilable expression for a single statement, but cannot span multiple statements Solution: PETE combined with Comma Operator Approach: Use comma operator to build expression list PETE Expression list destructor triggers evaluation when semicolon is reached out = in > thresh, thresh &= 0.9 * thresh * in; Slide-13

15 Optimal Mapping of Stream Algorithms Input Low Pass Filter Beamform Matched Filter X IN IN X IN IN FIR1 FIR2 X OUT X IN IN mult X OUT X IN IN FFT IFFT X OUT W 1 W 2 W 3 W 4 1 PE 2 PE 3 PE 4 PE frames/sec (1.6 MHz BW) 66 frames/sec (3.2 MHz BW) Slide-14 Example - Find: Min(latency #CPU) Max(throughput #CPU Example - Use: Genetic Algorithm Decision Directed Learning

16 Predicted and Achieved Latency and Throughput 1.5 Problem Size Small (48x4K) Large (48x128K) Latency (seconds) Throughput (frames/sec) #CPU predicted achieved predicted achieved predicted achieved predicted achieved #CPU Graph Solution selects correct optimal mapping Good agreement between predicted and achieved latencies and throughputs Slide-15

17 Summary High-performance stream processing for DoD applications requires 100 s GOPS in stringent form factors. Programmable solutions are parallel processors Computational efficiency is paramount Middleware approaches emerging today have attributes of stream languages except: Lack rigorous formal definitions (and are incomplete) Do not have good compiler support (efficiency issues) BUT: they provide high productivity compared to traditional approaches AND better lifecycle support is in the business of evaluating new technologies for DoD signal and image processing applications Middleware, languages, FPGAs, VLSI, DSP, PCA, Slide-16

18 PETE Review Step 1: Operators Form expression A= + A=B+C*D B * BinaryNode<OpAdd, float*, Binary Node <OpMul, float*, float*>> C D Vector Operator = it=begin (); for (int i=0; i<size(); i++) { Step 2: = operator evaluates expression Action performed at leaves Action performed at internal nodes OpAdd + OpCombine User specifies what to store at the leaves Dereference Leaf } *it=foreach(expr, DereferenceLeaf(), OpCombine() ); foreach (expr, IncrementLeaf(), NullCombine() ); it++; *it= *bit + *cit * *dit; bit++; cit++; dit++; Increment Leaf PETE ForEach: Recursive descent traversal of expression - User defines action performed at leaves - User defines action performed at internal nodes Translation at compile time - Template specialization - Inlining Slide-17

19 PETE and the Comma Operator Step 1: Operators form expression list, out = in > thresh, thresh = 0.9*thresh + 0.1*in; out = in > thresh thresh = * + * 0.9 thresh 0.1 in Expression list destructor if (isbasenode) { for (int i=0; i<size; i++) { } } Step 2: Expression list destructor triggers expression evaluation Comma orders evaluation from left to right = operator + OpCombine foreach(expr, DereferenceLeaf(), OpCombine() ); foreach(expr, IncrementLeaf(), NullCombine() ); *out = *in1 > *thresh1; *thresh2 = 0.9 * *thresh * *in2; out++; in1++; thresh1++; thresh2++; thresh3++; in2++; PETE ForEach changes: - = operator behavior with OpCombine: assign from RHS to LHS -, operator behavior: first evaluate left expression, then evaluate right expression Slide-18

20 Scalable Approach #include <Vector.h> #include <AddPvl.h> void addvectors(amap, bmap, cmap) { Vector< Complex<Float> > a( a, amap, LENGTH); Vector< Complex<Float> > b( b, bmap, LENGTH); Vector< Complex<Float> > c( c, cmap, LENGTH); b = 1; c = 2; a=b+c; Single Processor Mapping A = B + C Multi Processor Mapping A = B + C } Single processor and multi-processor code are the same Maps can be changed without changing software High level code is is compact Slide-19

21 Corner Turn Operation Filter Beamform Original Data Matrix Channels Processor Channels Cornerturned Data Matrix Time Samples Time Samples All-to-all communication Slide-20 Half the data moves across the bisection of the machine

22 Anatomy of a PVL Map Vector/Matrix Stream Function Conduit All PVL objects contain maps Map PVL Maps contain Grid List of nodes Distribution Overlap {0,2,4,6,8,10} List of Nodes Grid Distribution Overlap Slide-21

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

PCAKernels-1 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures Hank Hoffmann James Lebak (Presenter) Janice McMahon Seventh Annual High-Performance Embedded Computing Workshop (HPEC)