John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Size: px

Start display at page:

Download "John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands"

Randolph Williams
6 years ago
Views:

1 Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1

2 Overview radio telescopes six radio telescope algorithms on GPUs part 1: real-time processing of telescope data 1) FIR filter 2) FFT 3) bandpass correction 4) delay compensation 5) correlator part 2: creation of sky images 6) gridding (new GPU algorithm!) 2

3 Intro: Radio Telescopes 3

4 LOFAR Radio Telescope largest low-frequency telescope distributed sensor network ~85,000 sensors 4

5 LOFAR: A Software Telescope different observation modes require flexibility standard imaging pulsar survey known pulsar epoch of reionization transients ultra-high energy particles need supercomputer real time 5

6 LOFAR Data Processing Blue Gene/P supercomputer 6

7 Square Kilometre Array future radio telescope huge processing requirements TFLOPS LOFAR (2012) SKA 10% (2016) Full SKA (2020) ~30 ~30,000 ~1,000,000 7

8 Part 1: Real-Time Processing of Telescope Data 8

9 Rationale 2005: LOFAR needed supercomputer 2012: can GPUs do this work? 9

10 Blue Gene/P Algorithms on GPUs BG/P software complex several processing pipelines try imaging pipeline on GPU computational kernels only other pipelines + control software: later 10

11 CUDA or OpenCL? OpenCL advantages vendor independent runtime compilation: easier programming (parameters constant) float2 samples[nr_stations][nr_channels][nr_times][nr_polarizations]; OpenCL disadvantages less mature e.g., poor support for FFTs cannot use all GPU features go for OpenCL 11

12 Poly-Phase Filter (PPF) bank splits frequency band into channels like prism time resolution freq. resolution 12

13 Poly-Phase Filter (PPF) bank FIR filter + FFT 13

14 1) Finite Impulse Response (FIR) Filter history & weights (in registers) no physical shift many FMAs operational intensity = 32 ops / 5 bytes 14

15 Performance Measurements maximum foreseen LOFAR load 77 stations KHz dual pol 2x8 bits/sample 240 Gb/s GTX 580, GTX 680, HD 6970, HD 7970 need Tesla quality for real use 15

16 FIR Filter Performance GTX 580 performs best restricted by memory bandwidth 16

17 2) FFT 1D complex complex points tweaked Apple FFT library 64 work items: 1 FFT 256 work items: 4 FFTs 17

18 FFT Performance N=256 tweaked library 5 n log(n) 18

19 Clock Correction corrects cable length errors merge with next step (phase delay) 19

20 3) Delay Compensation (a.k.a. Tracking) track observed source delay telescope data delay changes due to earth rotation shift samples remainder: rotate phase (= cmul) 18 FLOPs / 32 bytes 20

21 4) BandPass Correction powers in channels unequal artifact from station processing multiply by channel-dependent weights 1 FLOP / 8 bytes 21

22 Transpose reorder data for next step (correlator) through local memory see talk S

23 Combined Kernel combine: delay compensation bandpass correction transpose reduces global memory accesses 18 FLOPs / 32 bytes 23

24 Delay / Band Pass Performance poor operational intensity 156 GB/s! 24

25 5) Correlator see previous talk (S0347) multiply samples from each pair of stations integrate ~1s 25

26 Correlator Implementation global memory local memory 1 thread: 2x2 stations (dual pol) 4 float4 loads 64 FMAs 32 accumulator registers one thread 26

27 Correlator #Threads 1024 max #threads #threads GTX GTX HD HD #stations 58 HD 6970 / HD 7970 need multiple passes! 77 27

28 Correlator Performance HD 7970: multiple passes register usage low occupancy 28

29 Combined Pipeline full pipeline 2 host threads own queue, own buffers overlap I/O & computations easy model! H D H D FIR FFT D&B H D Correlate D H H D FIR FFT D&B Correlate 29

30 Overall Performance Imaging Pipeline #GPUs needed for LOFAR GTX 680 (marginally) fastest ~13 GPUs HD 7970 real improvement over HD

31 Performance Breakdown GTX 580 dominated by correlator correlator: compute bound others: memory I/O bound PCIe I/O overlapped 31

32 Performance Breakdown GTX 680 ~20% faster than GTX

33 Performance Breakdown HD 7970 multiple passes correlator visible poor overlap I/O 33

34 Performance Breakdown HD x slower 34

35 Are GPUs Efficient? FIR filter FFT Delay / BandPass Correlator GTX 680 ~21% ~17% ~2.6% ~35% Blue Gene/P 85% 44% 26% 96% % of FPU peak performance Blue Gene/P: better compute-i/o balance & integrated network few tens of GPUs as powerful as 2 BG/P racks 35

36 Feasible? imaging pipeline ~13 GTX 680s ( 8 Tesla K10) + RFI detection? other pipelines? 240 Gb/s FDR InfiniBand transpose 36

37 Future Optimizations combine more kernels fewer passes over global memory FFT: difficult invoke FFT from GPU kernel, not CPU 37

38 Conclusions Part 1 OpenCL ok FFT support = minimal GTX 680 (Kepler) marginally faster than HD 7970 (GCN) <35% of FPU peak: memory I/O bottleneck heavy use of FMA instructions LOFAR imaging pipeline on GPUs = feasible 38

39 Part 2: Creation of Sky Images 39

40 Context after observation: remove RFI calibrate create sky image calibration/imaging loop possibly repeated 40

41 Creating a Sky Image convolve correlations and add to grid 2D FFT sky image 41

42 Gridding corr conv (~100x100) convolve correlation and add to grid grid (~4096x4096) for all correlations 42

43 Two Problems corr conv (~100x100) 1. lots of FLOPS grid (~4096x4096) 2. add to memory: slow! 43

44 Two Solutions corr conv (~100x100) 1. lots of FLOPS use GPUs grid (~4096x4096) 2. add to memory: slow! avoid 44

45 This Is A Hard Problem GFLOPS 300 1) 35 2) ) ) literature: 4 other GPU gridders estimated perf. on GTX680 compensated faster hardware bandwidth difference + 50% x16 32x32 64x x x128 conv. matrix size 45 giga-pixel-updates-per-second

46 This Is A Hard Problem 400 search correlations 2) Cell BE (Varbanescu [PhD,'10]) private grid per block very small grids 4) Humphreys & Cornwell [SKA memo 132, '11] GFLOPS local store 3) van Amesfoort et. al. [CF'09] adds directly to grid in memory 1) 35 2) ) ) x16 32x32 64x x x128 conv. matrix size 46 giga-pixel-updates-per-second 1) MWA (Edgar et. al. [CPC'11]) 50

47 This Is A Hard Problem GFLOPS 300 1) 35 2) ) ) ~3% of FPU peak performance! SKA: exascale x16 32x32 64x x x128 conv. matrix size 47 giga-pixel-updates-per-second

48 W-Projection Gridding depends on frac(u), frac(v), w corr conv correlation has associated (u,v,w) coords (u,v) not exact grid points use different convolution matrices (int(u), int(v)) grid choose most appropriate one 48

49 Where Is The Data? corr conv (~100x100) grid: device memory conv. matrices: texture correlations + (u,v,w) coords: shared (local) memory grid (~4096x4096) 49

50 Placement Movement f corr conv t per baseline: grid (u,v,w) changes slowly grid locality 50

51 Use Locality corr reduce #memory accesses X: one thread accumulate additions in register until conv. matrix slides off conv grid 51

52 But How??? corr conv 1 thread / grid point grid which correlations contribute? severe load imbalance 52

53 An Unintuitive Approach corr conv grid conceptual blocks of conv. matrix size 53

54 An Unintuitive Approach corr 1 thread monitors all X at any time: 1 X covers conv. matrix!!! conv grid 54

55 An Unintuitive Approach corr conv thread computes current: grid X grid point X conv. matrix entry 55

56 An Unintuitive Approach corr conv (u,v) coords change grid 56

57 An Unintuitive Approach corr conv (u,v) coords change more grid 57

58 An Unintuitive Approach corr conv grid (atomically) adds data if switching to another X 58

59 An Unintuitive Approach corr conv grid #threads = block size too many threads do in parts 59

60 (Dis)Advantages corr conv grid overhead < 1% grid-point memory updates 60

61 Performance Measurements 61

62 Performance Tests Setup #stations #channels integration time observation time conv. matrix size oversampling #W-planes grid size s 6h 256x256 8x x2048 (u,v,w) from real LOFAR observation (6 hour) 62

63 GTX 680 Performance (CUDA) Gpixels/s 25% of peak FPU overhead index computations most additions in registers 0.23%-0.55% atomic add = 26% of total run time! occupancy: texture hit rate: >0.872 GFLOPS x16 32x32 64x x x128 conv. matrix size 63 giga-pixel-updates-per-second GTX 680 (CUDA)

64 GTX 680 Performance (OpenCL) OpenCL slower than CUDA no atomic FP add! use atomic cmpxchg V1.1: no 1D images (added in V1.2) 2D image: slower x16 32x32 64x x x128 conv. matrix size 64 giga-pixel-updates-per-second GFLOPS GTX 680 (CUDA) 1000 GTX 680 (OpenCL) 900

65 HD 7970 Performance (OpenCL) GTX 680 (CUDA) 1000 GTX 680 (OpenCL) 900 HD GFLOPS 700 medium & large conv. size: outperforms GTX 680 ~25% > bandwidth, FPU, power small conv. size: poor computation-i/o overlap map host memory into device x16 32x32 64x x x128 conv. matrix size 65 giga-pixel-updates-per-second 110

66 2 x Xeon E Performance (C++/AVX) GTX 680 (CUDA) 1000 GTX 680 (OpenCL) 900 HD x E C++ & AVX vector intrinsics adds directly to grid relies on L1 cache works well on CPU insufficient cache for GPUs 48-79% of peak FPU GFLOPS x16 32x32 64x x x128 conv. matrix size 66 giga-pixel-updates-per-second 110

Multi-GPU Scaling eight Nvidia GTX 580s 5000 256x256 64x64 16x16 GFLOPS

67 Multi-GPU Scaling eight Nvidia GTX 580s x256 64x64 16x16 GFLOPS nr. GPUs 8 131,072 threads! scales well 67

68 Green Computing power efficiency (GFLOP/W) 2.5 power consumption (kw) 2 256x256 64x64 16x nr. GPUs x256 64x64 16x nr. GPUs up to 1.94 GFLOP/w (with previous gen hardware!) 68

69 1) MWA (Edgar et. al. [CPC'11]) 2) Cell BE (Varbanescu [PhD,'10]) 4) Humphreys & Cornwell [SKA memo 132, '11] new method ~10x faster GFLOPS 3) van Amesfoort et. al. [CF'09] new ) ) 3) ) x16 32x32 64x x x128 conv. matrix size 69 giga-pixel-updates-per-second Compared To Other GPU Gridders

70 See Also An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs, John W. Romein, ACM International Conference on Supercomputing (ICS'12), June 25-29, 2012, Venice, Italy 70

71 Future Work LOFAR gridder combine with A-projection time-dependent conv. function compute on GPU 71

72 Conclusions Part 2 efficient GPU gridding algorithm minimizes memory accesses OpenCL lacks atomic floating-point add ~10x faster than other gridders scales well on 8 GPUs energy efficient 72

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC