Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University

Outline Motivation HPEC Implementation and Evaluation Kernel Benchmarks Synthetic Aperture Radar Performance Comparison Conclusion 2 2

HPEC: High Performance Embedded Computing Future IT infrastructure demands even higher computing power High performance radar : 800GFLOPs(Giga FLoating point Operations Per second) 4G wireless base station: 1Gbit/s data rate per customer and up to 200 subscribers in service area CMU driverless car: 270GFLOPs 3 3

Implication An increasing number of high performance embedded applications would be implemented with multi-core devices Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor Huawei: multi-core base stations Systematically evaluating the potential of GPU Performance Scalability 4 4

HPEC Challenge Benchmark Developed by MIT Lincoln Laboratory* Quantitatively evaluate HPEC systems Kernel benchmarks: extracted from a broad range of signal and image processing application 5 * The HPEC Challenge Benchmark Suite, R. Haney, T. Meuse, J. Kepner, HPEC 2006 5

Kernel Benchmarks Category Benchmark Description TDFIR Time-domain finite impulse response filtering FDFIR Frequency-domain finite impulse response filtering Signal Processing QR SVD QR factorization: prevalent in target recognition algorithms Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference CFAR Constant false-alarm rate detection: find target in an environment with varying background noise Communication CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT PM Pattern Matching: identify stored tracks that match a target Information Processing GA Graph optimization via genetic algorithm: removing uncorrelated data relations 6 DB Database operations to store and query target tracks 6

7 Benchmark Properties Benchmark TD FIR FD FIR CT PM CFAR GA QR SVD DB Data Set Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set3 Set4 Set1 Set2 Set3 Set4 Set1 Set2 Set3 Set1 Set2 Set1 Set2 Workload (MFLOPS)* 268.4 1.97 34 2.21 2 30 1.21 13.59 0.17 150.5 41.1 17.7 0.011 0.51 0.015 0.11 397 30.5 45 0.24 0.88 440 700 Task-Level Parallelism 64 20 64 20 1 1 72 256 384 6144 3072 480 50 200 100 400 1 1 1 1 1 Data Structure Vector Vector Matrix Vector Vector Vector Matrix Matrix Data Size 4096 1024 4096 1024 50x5000 750x5000 64 128 64 3500 1909 9900 8 96 5 10 500x100 180x60 150x150 500x100 180x60 * The workload of CT and DB are measured in MB and Transactions, respectively 1 1 Tree 440 700 Data Correlation Low Low Very Low Low Medium Medium High High High Memory Access Low Low Very High Low Low High Medium Medium Very High 7

Implementation on GPU (1) Plenty of data level parallelism Raw computing power Loops of multiplication and accumulation (MAC) TDFIR FDFIR CFAR 8 8

Implementation on GPU (2) Plenty of task level parallelism Synchronization between blocks PM GA 9 9

Implementations on GPU (3) Memory accessing operation Global memory accessing coalescing Shared memory for local operation Database 10 CT DB 10

Implementations on GPU (4) Advanced linear algebra operation Hard to explore parallelism Pipelining the row updates of matrix QR SVD 11 (a) Threads assignment (b) Step 1 (c) Step 2 and 3 11

Experiment Environment 12 CPU Intel Core2 Duo 2.66GHz 4GB memory GPU NVidia Tesla C2050 : 448cores,1.15GHz 3GB memory DSP ADSP-TS101S Tiger SHARC T2-PCI 8 DSP processor, 600MHz 24Mbits on-chip memory per DSP 12

Performance Comparison Kernels TDFIR FDFIR CT PM CFAR GA QR SVD DB Data Set Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 1 Set 2 Set 1 Set 2 DSP Throughput (GFLOPS)* 6.865 0.84 3.144 0.588 0.488 2.568 2.408 2.088 1.552 3.056 2.408 2.576 0.6 CPU Throughput (GFLOPS) * 3.382 3.326 0.541 0.542 1.194 0.501 0.871 0.281 1.154 1.314 1.313 1.261 0.562 0.683 0.441 0.373 1.704 0.901 0.904 0.747 0.791 112.3 5.794 GPU Throughput (GFLOPS) * 97.506 23.130 61.681 11.955 17.177 35.545 7.761 21.241 2.234 17.319 13.962 8.301 1.177 8.571 0.589 2.249 54.309 5.679 6.686 4.175 2.684 126.8 8.459 Speedup 14.2/28.8 27.5/6.9 19.6/114.1 20.3/22.1 14.3 70.9 8.9 75.6 4.5/1.9 6.7/13.1 5.8/10.6 3.9/6.6 2.1 12.5 1.4 6.0 34.9/31.8 1.8/6.3 2.7/7.4 1.6/5.6 4.5/3.4 1.13 1.46 13 * The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively. 13

Power Efficiency Comparison CPU: 65w, GPU: 238w, DSP: 10w GPU suffers from a low power-efficiency 14 14

Synthetic Aperture Radar Benchmark Simulating a sensor processing chain Data Set Set 1 Set 2 Set 3 Image Size 382x266 762x512 1144x756 FFT/IFF T 28.61 113.38 259.92 Work Load (MFLOP) Match Filtering Interope ration Miscella neous 6.42 22.06 47.08 56.88 195.34 416.96 1.23 4.43 9.62 Total 93.14 335.21 733.58 15 15

Performance Result Data Set Kernel CPU Throughput (GFLOPS) GPU Throughput (GFLOPS) Speedup FFT/IFFT 0.463 5.259 11.3 Set 1 Filtering 0.538 17.165 31.8 Interpolation 0.256 19.274 75.1 Overall 0.312 8.316 26.6 FFT/IFFT 0.581 9.252 15.9 Set 2 Filtering 0.545 25.241 46.3 Interpolation 0.252 17.332 68.8 Overall 0.327 9.507 29.1 FFT/IFFT 0.832 15.155 18.2 Set 3 Filtering 0.523 26.856 51.3 Interpolation 0.248 18.569 74.7 16 Overall 0.346 11.403 32.8 16

Overview of Optimization Techniques Maximizing the usage of on-chip resources Shared memory Registers Reducing memory accessing time Global memory accessing coalesced Overlapping transfers with computation Reducing divergence Warp level parallelism 17 17

Architecture Implication SIMD width: suitable for large vector computing Dynamically configurable SIMD width according to application Shared memory superior to cache for embedded application Data prefetch is preferred Special functions for specific applications Dedicated efficient shuffle network for fft, et. al. Power efficiency is quite low now Reorganizing memory access patterns New interconnection technologies : 3D stacking 18 18

Conclusion 19 Efficient implementations of the HPEC benchmarks on NVidia s Fermi Performance comparison with CPU Kernels: 10X speedup SAR: 30X speedup A detailed analysis provides key insight Optimizing data parallelism algorithm Bottleneck of GPU s architecture for HPEC Publications: Design Automation and Test in Europe (DATE), March 2011 Journal of Parallel and Distributed Computing, submitted under review. 19

Thank You! 20 20