Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures

Size: px

Start display at page:

Download "Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures"

Augustus Blake
5 years ago
Views:

1 Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Harshavardhan Reddy Suda NCRA, India Vinay Deshpande NVIDIA, India Bharat Kumar NVIDIA, India

2 What signals we are processing? Digitized baseband signals from 30 dual polarized antennas of GMRT GMRT The Giant Meter-wave Radio Telescope (GMRT) is a world class instrument for studying astrophysical phenomena at low radio frequencies Located 80 km north of Pune, 160 km east of Mumbai Array telescope with 30 antennas of 45 m diameter, operating at meter wavelengths

GMRT Supports two modes of operation : - Interferometry (correlator) - Array mode (beamformer) Frequency bands : - 130 to 260 MHz - 250 to 500 MHz - 550 to 900 MHz - 1050 to

3 GMRT Supports two modes of operation : - Interferometry (correlator) - Array mode (beamformer) Frequency bands : to 260 MHz to 500 MHz to 900 MHz to 1600 MHz Maximum instantaneous bandwidth : 400 MHz (Legacy GMRT = 32 MHz) Effective collecting area (2-3% of SKA) -30,000 sq m at lower frequencies -20,000 sq m at higher frequencies

4 The Giant Meter-wave Radio Telescope A Google eye view

5 GMRT receiver chain Signal processing in digital back-end Image courtesy : Ajith Kumar, NCRA

6 Computation requirements Antenna Signals(M=64) Sampler Maximum Bandwidth 400 MHz Fourier Transform O(NlogN) 16k point spectral channels 3 TFlops Phase Correction 0.1 TFlops MAC M(M+1)/2 6.6 TFlops Total ~ 10 TFlops

7 Design : Time slicing model

8 Design : Time slicing model A 4-node example Ant 1, Ant Ant 16 : Digitized data of baseband signals of Antennas

9 Implementation 16 Dell T630 machines as Compute Nodes 16 ROACH (FPGA) boards with Atmel/e2v based ADCs developed by CASPER group, Berkeley for digitization and packetization 32 Tesla K40c GPU cards for processing 36 port Mellanox Infiniband switch for data sharing between Compute Nodes and Host Nodes Software : C/C++ and CUDA C programming with OpenMPI and OpenMP directives Developed in collaboration with Swinburne University, Australia

10 Implementation Image courtesy : Irappa Halagalli, NCRA

11 Sample result Image of Coma cluster Legacy GMRT 325 MHz : 350 μjy Upgraded GMRT MHz : 28 μjy Significantly lower noise RMS and better image quality with upgraded GMRT Dharam Vir Lal and Ishwar Chandra, NCRA

12 Computation Performance : K40 Channels FFT (Gflops) MAC (Gflops) No. of antennas : 32 (dual pol) CUDA 7.5

13 Motivation for next generation GPUs Adding more compute intensive applications - Multi-beamforming - Processing on each beam (beam steering) - Gated correlator - FIR filtering with many taps for narrow-band mode implementation Working GMRT system and code provides an excellent testing ground for the features of next generation GPUs Performance measured and compared on GP100 and V100

14 Computation performance K40 vs GP100 Cuda 7.5, ECC off Performance follows CUFFT benchmarks for K40 and P100 Reference for K40 benchmark : CUDA 6.5 performance report, September 2014 Reference for P100 benchmark : CUDA 8 PERFORMANCE OVERVIEW, November 2016

15 Computation performance : K40 vs GP100 Cuda 7.5, ECC off No. of antennas : 32 (dual pol)

16 Computation performance : K40 vs GP100 Cuda 7.5, ECC off Peak Performance : K TFlops GP TFlops Peak Global Memory Bandwidth : K GB / sec GP GB / sec

17 Computation performance as % of Real-time Bandwidth : 200 MHz No. of antennas : 32 (dual pol) Spectral Channels : 16384

18 Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster)

19 Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster) No. of antennas : 32 (dual pol)

20 Computation performance : GP100 vs V100 GP100 on Cuda 7.5 V100 on Cuda 9.1 (using PSG cluster) Peak Performance : GP TFlops V TFlops Peak Global Memory Bandwidth : GP GB / sec V GB / sec

21 Reasons behind relatively low performance of MAC Non-contiguous Global Memory access at block level Low Arithmetic Intensity MAC input data format

22 GPU kernel improvements FFT : Single Precision to Half Precision floating point MAC : Simplified Index Arithmetic Improved the L2 hit ratio : less then 5% to nearly 86% Vectorized loads Increased ILP (float4) Exposing more parallelism by increasing the occupancy Single Precision to Half Precision floating point No performance gain

23 MAC : Performance gain with optimizations on V100 V100 on Cuda 9.1 (using PSG cluster) No. of antennas : 32 (dual pol)

24 FFT : Performance gain with half precision on V100 V100 on Cuda 9.1 (using PSG cluster)

25 FFT : Error analysis with half precision in power spectrum Spectral Channels : 2048 Batch size : 128

26 FFT : Error analysis with half precision in phase spectrum Spectral Channels : 2048 Batch size : 128

27 Going forward Improving MAC using Tensor cores potential 2x improvement Implementing the MAC optimizations and half-precision floating point FFT in the GMRT code Optimized FIR filtering routines in CUDA for narrow-band mode implementation Implementing multi-beamforming, beam steering and gated correlator

28 Acknowledgements Prof. Yashwant Gupta, Centre Director, NCRA Ajith Kumar B., Back-end group co-ordinator, GMRT, NCRA Sanjay Kudale, GMRT, NCRA Shelton Gnanaraj, GMRT, NCRA Andrew Jameson, Swinburne University, Australia Benjamin Barsdel, Swinburne University, Australia (now at Nvidia) CASPER Group, Berkeley Digital Back-end Group, GMRT, NCRA Computer Group, GMRT, NCRA Control Room, GMRT

29 Thank You

Internal Technical Report CPU-GPU based DIGITAL Backend

Internal Technical Report CPU-GPU based DIGITAL Backend S. Harshavardhan Reddy & Irappa M. Halagali Ver. 2.0, 11/06/2014. Index 1. INTRODUCTION 2. BLOCK DIAGRAM 3. SPECIFICATIONS a. ugmrt b. GWB-II c.