Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Size: px

Start display at page:

Download "Evaluating the Potential of Graphics Processors for High Performance Embedded Computing"

Katrina Allison
5 years ago
Views:

1 Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University

2 Outline Motivation HPEC Implementation and Evaluation Kernel Benchmarks Synthetic Aperture Radar Performance Comparison Conclusion 2 2

3 HPEC: High Performance Embedded Computing Future IT infrastructure demands even higher computing power High performance radar : 800GFLOPs(Giga FLoating point Operations Per second) 4G wireless base station: 1Gbit/s data rate per customer and up to 200 subscribers in service area CMU driverless car: 270GFLOPs 3 3

Implication An increasing number of high performance embedded applications would be implemented with multi-core devices Intel: cluster based Internet routers IBM:

4 Implication An increasing number of high performance embedded applications would be implemented with multi-core devices Intel: cluster based Internet routers IBM: signal processing and radar applications on Cell processor Huawei: multi-core base stations Systematically evaluating the potential of GPU Performance Scalability 4 4

5 HPEC Challenge Benchmark Developed by MIT Lincoln Laboratory* Quantitatively evaluate HPEC systems Kernel benchmarks: extracted from a broad range of signal and image processing application 5 * The HPEC Challenge Benchmark Suite, R. Haney, T. Meuse, J. Kepner, HPEC

6 Kernel Benchmarks Category Benchmark Description TDFIR Time-domain finite impulse response filtering FDFIR Frequency-domain finite impulse response filtering Signal Processing QR SVD QR factorization: prevalent in target recognition algorithms Singular value decomposition: produces a basis for the matrix as well as the rank for reducing interference CFAR Constant false-alarm rate detection: find target in an environment with varying background noise Communication CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT PM Pattern Matching: identify stored tracks that match a target Information Processing GA Graph optimization via genetic algorithm: removing uncorrelated data relations 6 DB Database operations to store and query target tracks 6

7 7 Benchmark Properties Benchmark TD FIR FD FIR CT PM CFAR GA QR SVD DB Data Set Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set1 Set2 Set3 Set4 Set1 Set2 Set3 Set4 Set1 Set2 Set3 Set1 Set2 Set1 Set2 Workload (MFLOPS)* Task-Level Parallelism Data Structure Vector Vector Matrix Vector Vector Vector Matrix Matrix Data Size x x x x60 150x x x60 * The workload of CT and DB are measured in MB and Transactions, respectively 1 1 Tree Data Correlation Low Low Very Low Low Medium Medium High High High Memory Access Low Low Very High Low Low High Medium Medium Very High 7

8 Implementation on GPU (1) Plenty of data level parallelism Raw computing power Loops of multiplication and accumulation (MAC) TDFIR FDFIR CFAR 8 8

9 Implementation on GPU (2) Plenty of task level parallelism Synchronization between blocks PM GA 9 9

10 Implementations on GPU (3) Memory accessing operation Global memory accessing coalescing Shared memory for local operation Database 10 CT DB 10

11 Implementations on GPU (4) Advanced linear algebra operation Hard to explore parallelism Pipelining the row updates of matrix QR SVD 11 (a) Threads assignment (b) Step 1 (c) Step 2 and 3 11

12 Experiment Environment 12 CPU Intel Core2 Duo 2.66GHz 4GB memory GPU NVidia Tesla C2050 : 448cores,1.15GHz 3GB memory DSP ADSP-TS101S Tiger SHARC T2-PCI 8 DSP processor, 600MHz 24Mbits on-chip memory per DSP 12

13 Performance Comparison Kernels TDFIR FDFIR CT PM CFAR GA QR SVD DB Data Set Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 4 Set 1 Set 2 Set 3 Set 1 Set 2 Set 1 Set 2 DSP Throughput (GFLOPS)* CPU Throughput (GFLOPS) * GPU Throughput (GFLOPS) * Speedup 14.2/ / / / / / / / / / / / / * The throughputs of CT and DB are measured in Mbytes/s and Transactions/s, respectively. 13

14 Power Efficiency Comparison CPU: 65w, GPU: 238w, DSP: 10w GPU suffers from a low power-efficiency 14 14

38 259.92 Work Load (MFLOP) Match Filtering Interope ration Miscella neous 6.

15 Synthetic Aperture Radar Benchmark Simulating a sensor processing chain Data Set Set 1 Set 2 Set 3 Image Size 382x x x756 FFT/IFF T Work Load (MFLOP) Match Filtering Interope ration Miscella neous Total

16 Performance Result Data Set Kernel CPU Throughput (GFLOPS) GPU Throughput (GFLOPS) Speedup FFT/IFFT Set 1 Filtering Interpolation Overall FFT/IFFT Set 2 Filtering Interpolation Overall FFT/IFFT Set 3 Filtering Interpolation Overall

17 Overview of Optimization Techniques Maximizing the usage of on-chip resources Shared memory Registers Reducing memory accessing time Global memory accessing coalesced Overlapping transfers with computation Reducing divergence Warp level parallelism 17 17

18 Architecture Implication SIMD width: suitable for large vector computing Dynamically configurable SIMD width according to application Shared memory superior to cache for embedded application Data prefetch is preferred Special functions for specific applications Dedicated efficient shuffle network for fft, et. al. Power efficiency is quite low now Reorganizing memory access patterns New interconnection technologies : 3D stacking 18 18

Conclusion 19 Efficient implementations of the HPEC benchmarks on NVidia s Fermi Performance comparison with CPU Kernels: 10X speedup SAR: 30X speedup A detailed analysis provides key insight

19 Conclusion 19 Efficient implementations of the HPEC benchmarks on NVidia s Fermi Performance comparison with CPU Kernels: 10X speedup SAR: 30X speedup A detailed analysis provides key insight Optimizing data parallelism algorithm Bottleneck of GPU s architecture for HPEC Publications: Design Automation and Test in Europe (DATE), March 2011 Journal of Parallel and Distributed Computing, submitted under review. 19

20 Thank You! 20 20

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for Performance Embedded Computing Shuai Mu 1, Chenxi Wang 1, Ming Liu 2, Dongdong Li 2, Maohua Zhu 1, Xiaoliang Chen 3, Xiang Xie 1, Yangdong Deng 1 1 Tsinghua