Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

Size: px

Start display at page:

Download "Kernel Benchmarks and Metrics for Polymorphous Computer Architectures"

Betty Lloyd
6 years ago
Views:

1 PCAKernels-1 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures Hank Hoffmann James Lebak (Presenter) Janice McMahon Seventh Annual High-Performance Embedded Computing Workshop (HPEC) 24 September 2003 This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

Targeting Mission Cycle Detection Location Identification Target Nomination Weapon Selection Targeting Attack Assessment Future Warfighting Scenarios Examples SIGINT Communication Satellite Airborne

2 Targeting Mission Cycle Detection Location Identification Target Nomination Weapon Selection Targeting Attack Assessment Future Warfighting Scenarios Examples SIGINT Communication Satellite Airborne Vehicle Surveillance Satellite Communication Antenna Aegis Cruiser Key Program Goal: Re-configurable Embedded Processor Approach to provide Multi-Mission Capability PCAKernels-2 Personal Terminals Micro UAV

$computing tiles Distributed Cache Systolic Dedicated co-processors 1 morph \ mor-()f\ n : re-structuring of tiles for optimized processing 1 morph \ mor-()f\ n : re-structuring of tiles for optimized$

3 Polymorphous Computing Stream processing Regular, deterministic operations Constant flow of input data Threaded processing Complex operations Dynamic data movement SIMD P M Set of homogenous computing tiles Distributed Cache Systolic Dedicated co-processors 1 morph \ mor-()f\ n : re-structuring of tiles for optimized processing 1 morph \ mor-()f\ n : re-structuring of tiles for optimized processing 2 morph \ mor-()f\ vt : to re-structure tiles for optimized processing PCAKernels-3

4 Architectural Flexibility Radar Processing Flow Front end Signal Processing Detection/ Estimation Back end Discrimination/ Identification Command Control Performance Signal Processing Benchmark 1 Signal Processing Benchmark 2 Information Processing Benchmark Knowledge Processing Benchmark Intelligence Processing Benchmark PCA Server Class PPC Class DSP Class Structured Bit-operations Vectors/ Streaming Dynamic/ Threading Symbolic Operations Specialized Class DSP Class PPC Class Server Class PCA PCAKernels-4

5 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-5

6 Kernel Synthesis from Application Survey Specific Application Areas Radar Sonar Infrared Hyper-Spectral SIGINT Communication Data Fusion Broad Processing Categories Front-end Processing Data independent, stream-oriented Signal processing, image processing, high-speed network communication Examples: pulse compression adaptive beamforming target detection Back-end Processing Data dependent, thread oriented Information processing, knowledge processing Examples: workload optimization target classification Specific Kernels Signal/Image Processing FIR Filter SVD CFAR Detection Communication Corner Turn Information/Knowledge Processing Graph Optimization Pattern Recognition Real-time Database Operations PCAKernels-6 MIT-LL MIT-LL Surveyed Surveyed DoD DoD Applications Applications to to Provide: Provide: Kernel Kernel Benchmark Benchmark Definitions Definitions Example Example Requirements Requirements and and Data Data Sets Sets

7 Kernel Performance Evaluation Kernel Benchmarks Performance Metrics Definitions Signal/Image Processing FIR Filter SVD CFAR Detection Communication Corner Turn Information/Knowledge Processing Graph Optimization Pattern Recognition Real-time Database Operations Floating point and integer ops Latency Throughput Efficiency Stability Density and cost Size Weight Power Workload (FLOPS or OPS) Execution time (seconds) Throughput Hardware Peak MIN(Throughput) MAX(Throughput) PowerPC(G4) RAW Smart Memory TRIPS MONARCH PCAKernels-7

8 Throughput Workload (FLOPS or OPS) Execution time (seconds) Throughput-Stability Product A New Kernel Metric Throughput x Stability rewards consistent high performance penalizes lack of performance or lack of consistency Interval Stability MIN I (Throughput) MAX I (Throughput) PCAKernels-8 For For a given given application, PCA PCA processors should should achieve higher higher product of of throughput and and stability than than conventional processors

9 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-9

10 High Performance Programming: Conventional vs. PCA Processors PowerPC(G4) Raw Characteristics: Rigid memory hierarchy Rigid datapath Specialized Structures High Performance Programming: Change algorithm to match memory hierarchy One degree of freedom Can only work with blocking factor PCAKernels-10 Characteristics: Flexible memory hierarchy Flexible datapath(s) Generic Structures High Performance Programming: Co-optimize algorithm and architecture Many degrees of freedom Optimize time/space tradeoff PCA PCA provides more more degrees of of freedom, and and thus thus greater flexibility (morphability) and and greater performance over over a range range of of applications

Kernel Benchmarks and the PowerPC G4 Main Memory PowerPC G4 7410 Specs 500 MHz Clock rate 4 Gflop/s peak 125 MHz main memory bus L1 cache: 32 kb, on chip L2 cache: 2MB, 250 MHz bus Mercury

11 Kernel Benchmarks and the PowerPC G4 Main Memory PowerPC G Specs 500 MHz Clock rate 4 Gflop/s peak 125 MHz main memory bus L1 cache: 32 kb, on chip L2 cache: 2MB, 250 MHz bus Mercury daughtercard L2 Cache Two predictors of kernel performance: Programmer s maximization of data reuse and locality (blocking factor) Memory hierarchy of G4 Blocking factor determines max achieved performance Memory hierarchy determines shape of performance curve Want to maximize blocking factor to limit memory hierarchy bottleneck PCAKernels-11

12 FIR Filter (G4) 2000 FIR Filter Throughput (MFLOPS/sec) Number of filters = 4 Filter size = 16 FIR Throughput? Stability Number of filters = 4 Filter size = Level 1 Cache Level 2 Cache Level 1 Cache Level 2 Cache K 8K 32K 128K 512K PCAKernels-12 Vector Length PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec Mean Mean Efficiency: 29% 29% *Implemented with VSIPL Real FIR Filter Caches are are performance bottlenecks Performance Performance curve curve changes changes when when cache cache is is full full Product Product metric metric penalizes penalizes G4 G4 for for performance drop performance drop at at cache cache boundaries boundaries

Baseline Performance Measurements: Throughput and Stability Throughput Data Set and Overall Stability PowerPC G4 (Mercury) 500 MHz 32 KB L1 2 MB L2 Peak: 4 GFLOPS/sec PCAKernels-13 Data

13 Baseline Performance Measurements: Throughput and Stability Throughput Data Set and Overall Stability PowerPC G4 (Mercury) 500 MHz 32 KB L1 2 MB L2 Peak: 4 GFLOPS/sec PCAKernels-13 Data Set Stability: Overall Stability: Ratio of minimum to maximum over all data set sizes for a particular kernel Ratio of minimum to maximum over all floating-point kernels&all data set sizes

14 Stream Algorithms for Tiled Architectures Systolic Morph Time Time Space Space R M(R) edge tiles are allocated to memory management P(R) inner tiles perform computation systolically using registers and static network Stream Algorithm Efficiency: C(N) E (N,R) = where T(N,R)*(P(R) + M(R)) N = problem size R = edge length of tile array C(N) = number of operations T(N,R) = number of time steps P(R) + M(R) = total number of processors Compute Efficiency Condition: where? = N/R lim E(?,R) = 1?,R?? Stream algorithms achieve high high efficiency by by optimizing time time space space tradeoff tailoring memory hierarchy and and datapaths to to specific needs needs of of application PCAKernels-14

15 Time Domain Convolution on RAW RAW Chip with R rows and R+2 columns: Number of filters = R Number of memory tiles: M = 2*R Number of processing tiles: P = R 2 Manage Input Vectors Systolic Array for K Tap Filter Manage Output Vectors Each row performs a number of K tap filters Stream algorithms achieve high high performance by by removing memory access access bottleneck from from computational critical critical path path PCAKernels-15

16 FIR Filter (RAW) 4 Throughput (GFLOPS/sec) 4 Throughput * Stability 3 3 Number of filters = K 2K 4K 8K Vector Length RAW: 250 MHz, 4 GFLOPS/sec K 4K 16K 64K 256K512K Vector Length G4: 500 MHz, 4 GFLOPS/sec PCAKernels-16 Raw Raw implements the the appropriate memory hierarchy for for the the problem Raw s Raw s Throughput x Stability score score stays stays high high

17 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-17

18 Singular Value Decomposition (SVD) Input Matrix Upper- Triangular Matrix Bidiagonal Matrix Diagonal Matrix? M Rows X=U? V H H U, U, V Unitary??Diagonal N Columns SVD is becoming more widely used in signal and image processing Important for spectral analysis Can also be used for adaptive beamforming, especially for illconditioned problems SVD kernel implementation is a Reduced SVD that begins with a QR factorization if M > N Uses Modified Gram-Schmidt QR factorization Many possible optimizations, especially block factorization PCAKernels-18

19 SVD Results (G4) SVD Throughput (Mflop/s) SVD Throughput? Stability PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec Mean Mean Efficiency: 16% 16% PCAKernels-19 Reduced SVD of a 16-column complex matrix Begins with MGS QR factorization (needs A+R) L1 cache drives inner loop performance 1: A+R fills L1 cache 2: One column of A is half of L1 cache

20 Modified Gram-Schmidt QR Results (G4) MGS Throughput (Mflop/s) MGS Throughput? Stability PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec Mean Mean Efficiency: 12% 12% PCAKernels-20 Modified Gram-Schmidt QR factorization of a 16- column complex matrix MGS is about 60% of SVD time L1 cache drives inner loop performance 1: A+R fills L1 cache 2: One column of A is half of L1 cache

21 SVD for RAW Architecture Input Matrix Banded Matrix Bidiagonal Matrix Diagonal Matrix? M Rows N Columns Goal is to match problem size and architecture Use 2D systolic morph maximizes time/space efficiency uses architecture in a scalable way Uses efficient QR/LQ approach to get to banded form Fast Givens approach for QR/LQ Decoupled algorithm with good parallelism Banded form matches array dimension of systolic morph provides high locality for reduction to bidiagonal form PCAKernels-21 Memory Tiles Compute Tiles Raw Raw implementation seeks seeks to to efficiently match match the the many many possible algorithms to to the the many many possible architectural configurations

22 RAW and G4 Results: Fast Givens QR Factorization The The QR QR is is a key key sub-kernel of of the the SVD Throughput (GFLOPS/sec) Throughput * Stability K 2K PCAKernels-22 N (for N by N matrices) K N (for N by N matrices) The The QR QR performance demonstrates the the benefit benefit of of the the PCA PCA approach on on matrix matrix algebra operations

Lincoln Laboratory PCA Testbed Test Bed Architecture Intel PC Dual processor 66

Mercury RACE/VME Solaris/MCOS SBC G4 DSP Test Bed Objectives Kernel Kernel

prototyping Annapolis Wildstar High Speed I/O DSP/ FPGA Unit under test RAW Test

23 Lincoln Laboratory PCA Testbed Test Bed Architecture Intel PC Dual processor 66 MHz/64-bit wide PCI bus Running Linux Clusters on LLAN Ethernet LAN PCI bus Mercury RACE/VME Solaris/MCOS SBC G4 DSP Test Bed Objectives Kernel Kernel performance evaluation Application morphing demonstration High-level software prototyping Annapolis Wildstar High Speed I/O DSP/ FPGA Unit under test RAW Test Board (October 2003) 2 MB DRAM High Speed I/O USB Interface Daughtercard High Speed A/D PCAKernels-23

24 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-24

25 Conclusions has defined kernel benchmarks for the PCA program Multiple categories of processing Based on DoD application needs Establishing a performance baseline on conventional architectures Performance is limited by the blocking factor and by the memory hierarchy Example: CFAR low ops/byte, 3% efficiency: FIR high ops/byte, 29% efficiency PCAKernels-25 PCA processors allow opportunities for high performance Performance achieved through co-optimization of the algorithm and the architecture Example: unusual SVD algorithm leads to high performance on Raw The greater degree of freedom allows greater optimization across a variety of problem domains

26 PCA Team Hector Chan Bill Coate Jim Daly Ryan Haney Hank Hoffmann Preston Jackson James Lebak Janice McMahon Eddie Rutledge Glenn Schrader Edmund Wong PCAKernels-26

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures James Lebak Hank Hoffmann Janice McMahon MIT Lincoln Laboratory

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures James Lebak Hank Hoffmann Janice McMahon Polymorphous computer architectures (PCA) are new computer architectures being developed under