Hardware Acceleration of Pulsar Search on FPGAs using OpenCL

Size: px

Start display at page:

Download "Hardware Acceleration of Pulsar Search on FPGAs using OpenCL"

Hector Gilbert
5 years ago
Views:

1 Hardware Acceleration of Pulsar Search on FPGAs using OpenCL Oliver Sinnen Haomiao Wang & Prabu Thiagaraj (Manchester Uni) Parallel and Reconfigurable Computing Department of Electrical and Computer Engineering University of Auckland Computing for SKA, 2017

2 Strong-field Test of Gravity using Pulsars Image credit: NASA. Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory)

3 Outline Overview and Task 1 Overview and Task

4 Outline Overview and Task 1 Overview and Task

5 Pulsar and Pulsar Search Observed radiation is a pulse Binary pulsar (Doppler effect) Acceleration search: 1) Time-domain 2) Frequency-domain.

6 Pulsar and Pulsar Search Frequency-domain Using matched filtering technique in Fourier domain to recover the signal into single bin. A r0 [r 0 ]+m/2 k=[r 0 ] m/2 where frequency r 0 is unknown. A k A r 0 k, Summation is computed at a range of frequencies r.

7 Block Overview of Pulsar Search Engine Beamformed Data (BFD) Data Receptor (RCPT) Filterbank Data Chuncks (FDC) Filterbank Data for Selected SP Candidates Dedispersion Buffer Creator (DDBC) Dedispersion Buffers (DB) RFI Mtigation (RFIM) Flagged DB (FDB) Dedispersed Data Buffer (DDB) Dedispersion Transform (DDTR) Dedispersed Data Buffer (DDB) Periodicity Search Buffer (PSB) Periodicity Search Buffer Creator (PSBC) To SDP Candidate Data Output Streamer (CDOS) Single Pulse Detector (SPCT) Complex Fourier Transform (CXFT) From SDP Candidate Folding and Optimsation (FLDO) Filterbank Data for Candidate Folding Full Filterbank Buffer Creator (FFBC) Single Pulse Optimiser (SPOPT) Single Pulse Sifter (SPSIFT) Birdie Zapping (BRDZ) Candidate Sifting (SIFT) Time Domain Candidate Optimisation (TDAO) Harmonic Summing (HRMS) Fourier Transform and Power Spectrum (PWFT) Time Domain Resampler Transform (TDRT) Inverse Complex Fourier Transform (icxft) Dereddening Spectrum (DRED) Common Single Pulse Time domain Acc Freq domain Acc Fourier Domain Candidate Optimisation (FDAO) Fourier Domain Acceleration Search (FDAS)

8 Fourier-domain Acceleration Search (FDAS) FDAS module is applied to search for (binary) pulsars with constant frequency derivatives in frequency-domain Beam2 Beami Beami signals are de-dispersed for 6,000 DMs DM1 PSS Engine_i Single Pulse Search Modules Time Domain Acceleration or BeamN Over 2,000 beams are formed at 4,096 channels/beam DM2... DMj... DM6000 Postprocessing Pre- Processing.RFIM.DDTR.PSBC.CXFT.BRDZ.DRED FDAS Module FT Convolution Module FIR_1 FIR_k FIR_85 Harmonicsum Module 85 FIR filters, maximum length is 421-tap

9 Specification of Task. Parameter Destriiption Value B # of beams DM # of de-dispersion measure (DM) trails 6000 T obs Observation period 540s t limit Time of executing one sample group 88ms N # of complex samples per group 2 22 M # of templates/filter 85 K # of average template/filter length > 200

10 Outline Overview and Task 1 Overview and Task

11 FT Convolution Complex floating-point operations Multiple long FIR filters Large input size Strict time limit Number of acceleration devices (CapEx) Energy consumption (OpEx)

12 Basic Element Time-domain FIR Filter (TDFIR) K 1 y m [i] = k=0 x m [i k]h m [k], for i = 0,1,...N 1 Frequency-domain FIR Filter (FDFIR) F {f h} = F {f } F {h}

13 Hardware Limitation Naïve Time Domain DSP block Single precision floating-point (SPF) multiplications (A + ib) (C + id) = (A C B D) + i(a D + B C) Naïve Frequency Domain Off-chip (global) memory Off-chip memory bandwidth RAM block On-chip (local) memory size 4-Million elements = 32MBytes.

14 Decomposition Algorithms Overlap-add Algorithm Split the coefficient array > OLA-TD Overlap-save Algorithm Split the input array > OLS-FD Length =Ncoef -1 Zero Input Data Input data Length = Ncoef /N +... Split Coefficients C_1 C_2 C_N Length = Ncoef /N -1 Output data_1 Zero Output data_2... Output data (a) OLA Convolve with subset coefficient group i Output data_i Output data_n ID_1 ID_i ID_2 Convolution with FIR filter ID_3... PD_i Discard the Ncoef-1 elements PD_1 PD_2 PD_3... PD_N Output Data (b) OLS Split the input into N small groups ID_N

15 Outline Overview and Task 1 Overview and Task

High-level Techniques 2GB DDR3 x 2 Memory (Global Memory) Maxeler MaxCompiler using Java to develop FPGA (HPCC2016) Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards), GPUs, and CPUs

16 High-level Techniques 2GB DDR3 x 2 Memory (Global Memory) Maxeler MaxCompiler using Java to develop FPGA (HPCC2016) Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards), GPUs, and CPUs (FPT2016, best paper candidate) DDR Controller & PHY DDR Controller & PHY Global Memory Interconnect Global Memory Interconnect Kernel Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline Pipeline Kernel Kernel Pipeline Kernel Pipeline Kernel Pipeline Kernels Pipeline Pipeline FPGA_i PCIe... FPGA_0 PCIe Block RAM Block RAM... Host PCIe Core 1... Core 4 Local Memory Interconnect Local Memory Interconnect Memory (DDR3 and SSD)

17 Kernel Structures OLA

18 Kernel Structures OLS AOLS Input Processed coefficients Output 2 nd launch 1 st launch Switch Global Memory 1 st launch 1 st launch: Bank1 2 nd launch 2 nd launch: Bank2 Global Memory 1 st launch: Bank2 2 nd launch: Bank1 Data Fetch and Multiplication Kernel (NDRange) Channels FFT/IFFT Kernel (Single) 1 st launch FFT 2 nd launch IFFT Channels Bit-Reverse Kernel (NDRange) Input Global Memory (Bank1) Processed coefficients Output FFT Data Fetch Kernel (NDRange) Global Memory (Bank2) IFFT Bit-Reverse Kernel (NDRange) Channels Channels FFT and Multiplication Kernel (Single) IFFT Kernel (Single) Channels Channels FFT Bit-Reverse Kernel (NDRange) Channels IFFT Data Fetch Kernel (NDRange) TOLS.

19 Outline Overview and Task 1 Overview and Task

20 Platform Overview and Task Table: Details of FPGA and GPU Platforms Device (Board) Terasic DE5-Net Sapphire Nitro R7 370 Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370 Technology 28nm 28nm 622,000 LEs Compute resource 256 DSP blocks 1024 Stream Processors On-chip memory size 50Mb Global memory size 2 x 2GB DDR3 3GB GDDR5 Global memory frequency 800MHz 5, 600MHz Memory interface width 2 x 64-bit 256-bit Max clock frequency 985MHz OpenCL Max power consumption 150W

21 Latency TDFIR vs FDFIR 4 TDFIR Kernels Naïve OLA TD-Naïve-64S TD-Naïve-64N OLA-64S OLA-64N 5 FDFIR Kernels Naïve OLS FD-Naïve AOLS TOLS AOLS-1024 AOLS-2048 AOLS-4096 TOLS-1024

22 Latency TDFIR vs FDFIR Latencies of a single FPGA (Intel Stratix V A7) in processing same input array using 9 different OpenCL kernels: Kernel Execution Latency (ms) TD Naïve 64S TD Naïve 64N OLA 64S OLA 64N FD Naïve AOLS 1024 AOLS 2048 AOLS 4096 TOLS FIR Filter Length

23 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points

24 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points

25 Multiple FIR Filters Even fastest kernel cannot meet time limit => Implement multiple FIR filters in parallel Problem: Bandwidth of off-chip memory is main problem Solution: Do more processing! Calculate power of complex values (need input in next stage) Problem: Number of DSP blocks limits number of parallelisable filters Solution: Downscale the FFT engine input size: 8 points > 4 points

26 Multiple FIR Filters 1) Multiple FIR filters 2) Power of complex values Becomes optimisation problem! TOLS points AOLS points 2 x AOLS points AOLS-1024-P 8points 2 x AOLS-1024-P 8points AOLS-1024-P 4points 3 x AOLS-1024-P 4points AOLS-2048-P 4points 3 x AOLS-2048-P 4points Unused DSP blocks FFT Engine Element-wise multiplications Power

27 Multiple FIR Filters and FPGAs 600 latency of 84 FIR Filters (ms) Device 1 FPGA 2 FPGAs 3 FPGAs Kernel Peformance (GFLOPS) Device 1 FPGA 2 FPGAs 3 FPGAs 0 0 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 4 20 Power Efficiency (GFLOPS/watt) Device 1 FPGA 2 FPGAs 3 FPGAs Energy Dissipation (Joule) Device 1 FPGA 2 FPGAs 3 FPGAs 0 0 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels 3xAOLS 1024 P 3xAOLS 2048 P 3xAOLS 4096 P OpenCL Kernels

28 Latency FPGA vs GPU Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs in processing 2 and 4 Million points: Kernel Execution Latency (ms) xAOLS 2048 P (4 Million) 3xAOLS 2048 P (2 Million) GPU FD (4 Million) GPU FD (2 Million) Total FIR Filters

29 Energy FPGA vs GPU Energy dissipation of single FPGA and GPU in executing the same task with different kernels: TOLS 1024 AOLS xAOLS 4096 P 3xAOLS 2048 P 3xAOLS 1024 P GPU FD Energy Dissipation (Joule)

30 Outline Overview and Task 1 Overview and Task

31 Harmonic-summing Input: Filter-output-plane (FOP, SPF points, ~1.33GBytes) Processing flow: 8 harmonic planes are generated based on FOP and the stretched planes One threshold for each row of each harmonic plane (overall: 85 8 = 680) Hundreds of candidates are recorded Output: Candidates each candidate contains the indexes of filter, harmonic, bin, and amplitude (up to 64-bit)

32 Harmonic-summing Problems: Input data size is too large (~1.33GBytes) On-chip memory size too small for all planes The cost of computation task is very cheap (SPF adds) and easy to parallelise Challenge: Off-chip memory bandwidth is issue Optimise data use (and reuse), computation not an issue

33 Conclusion Overview and Task FPGA-based implementation and optimisation of FT convolution (FIR filter), based on OLA and OLS algorithms High-level approaches to such tasks works well Covered large design space Easy porting and sharing with partners With multiple FPGAs, FPGA implementation has advantage over GPU in both performance (GFLOPS) and Energy efficiency

Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL

Accelerating the Pulsar Search Pipeline with FPGAs, Programmed in OpenCL Oliver Sinnen, Tyrone Sherwin, and Haomiao Wang & Prabu Thiagaraj (Manchester Uni/Raman Research Institute, Bangalore) Parallel