OSKAR: Simulating data from the SKA

OSKAR: Simulating data from the SKA Oxford e-research Centre, 4 June 2014 Fred Dulwich, Ben Mort, Stef Salvini 1

Overview Simulating interferometer data for SKA: Radio interferometry basics. Measurement equation basics. Structure of OSKAR. Experiences moving from Fermi to Kepler GPU architecture. Some recent simulation results. 2

Radio interferometry VLA (1973-1980) One-Mile Telescope (1964) First to use Earth-rotation aperture synthesis 3

Comparison with optical system Traditional optical telescope records image of the sky formed by lens (or mirror). Sky EM radiation from the sky Lens Image plane of lens 4

Comparison with optical system A radio interferometer samples the wave-front in the Fourier domain: Image formation done electronically. Sky EM radiation from the sky Array of detectors Processing Image formed by FT 5

Aperture arrays as stations Omni-directional antennas measure voltage signals from whole sky. Spatial filtering (electronic beam forming) to isolate a direction of interest. Advantages: Cost effective at low frequency No moving parts Fast scanning Multi-beaming capability Disadvantages: Sparse at high frequency Relatively high sidelobe levels Continually variable beam shape Continually variable beam polarisation 6

Modelling Challenges (1) AA have complex beam patterns that have to be modelled across whole sky 7

Modelling Challenges (2) Science goals demand very high sensitivity Require good understanding of instrumental characteristics Need comprehensive models of sky and telescope Very large instruments and sky model require HPC Design of SKA not yet finalised: simulator has to be flexible 8

Why simulate the SKA? Imaging performance depends strongly on how the detector elements are arranged. Aperture arrays have unique problems. Assess performance of evolving system design. Simulations can produce data challenges for pipeline developers. Ideas for SKA design have changed in recent years: Few large stations (11200 elements per station) Many small stations (256 elements per station) 9

Measurement Equation formalism A radio interferometer makes measurements of radiation in the Fourier domain (visibilities) for the true sky after various corruption effects, for example: Sky rotation (parallactic angle) Ionosphere Antenna pattern & shape of station beam The Hamaker-Bregman-Sault Measurement Equation of a radio interferometer can be used to simulate measured visibility data. Relies on concepts of: Source coherency matrix Jones matrix 10

Source coherency (brightness) matrix Source coherency matrix encapsulates source properties. Stokes parameters I, Q, U and V completely describe average polarisation of radiation from a source. Coherency matrix defined as 2x2 complex quantity for each source, s. Using linear polarisation basis: B s = " $ # $ I +Q U iv U + iv I Q % ' &' 11

Jones matrix Describes some physical effect on the radiation. For a single source, s, at a single receiving station, i. Jones matrix is another 2x2 complex quantity: Allows intermixing of polarisations. Allows modification of amplitude and phase of received electromagnetic wave. J s, i = a c 1 1 + + ia ic 2 2 b d 1 1 + + ib id 2 2 12

Jones matrix and Measurement Equation Gives modularity and makes complex simulations tractable: Jones matrices can be chained together. Allows us to separate different physical effects. Multiply matrices in order in which things actually happen: J = X s,i s i s, i s, i Visibility on baseline between stations i and j for all visible sources (s) is then: V i, j = s Y Z, J B J H s,i s s, j! 13

A pictorial Measurement Equation! B R Z E K V i, j = s H K s,i E s,i Z s,i R s,i B s R s, j Z H s, j E H H s, j K s, j 14

OSKAR overview (1) GPU-enabled software to produce simulated visibilities by direct evaluation of a measurement equation. Currently ~120000 lines of code, mostly C (some C++). Currently ~40 CUDA kernels/functions. Single or double precision computation available. Balance between highest performance and highest flexibility. Problem sizes vary hugely. Simulations need to run on many different systems. Minimize PCIe traffic: Copy input sky and telescope models to GPU memory. Intermediate data generated on the GPU and used without transfer to host. Host keeps track of pointers to GPU memory. Use GPU memory effectively as a giant cache. 15

OSKAR overview (2) Each source is independent with respect to all other sources. There are many sources in the sky Can trivially parallelise over sources. In general, each GPU thread works on one source. Easily guaranteed 10 4 10 5 threads for any given kernel launch. Most expensive steps: Station beam evaluation, for all stations. Compute limited (DFT). Cross-correlation step (visibility evaluation per baseline). Bandwidth limited (Kepler); register limited (double precision, Fermi). 16

Jones matrix data structure Station i (slowest varying) Source s (fastest varying) 17

Jones matrix data structure Station i (slowest varying) Source s (fastest varying) J s, i = a c 1 1 + + ia ic 2 2 b d 1 1 + + ib id OSKAR functions calculate each Jones matrix for each source at each station in GPU memory (used as scratchpad ). 2 2 18

Joining Jones matrices Station i (slowest varying) J = X, Y s,i s i s, i Source s (fastest varying) = x Trivially parallel: each thread does one colour 19

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j 3 B 2 V i, j = s H J s,i B s J s, j Exploits the fact that XY = Y H X H Each thread block computes result for one baseline, or one correlation between two stations, for all sources. Each thread does a subset of sources. Accumulates partial sum into shared memory. Result of final accumulation into global memory. 20

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j B V i, j = s H J s,i B s J s, j 3 2 Multiply together numbered cells. Accumulate results. One shared memory location per colour/thread (partial sum). Final step adds different colours, putting result into global memory. 21

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j 3 B 2 V i, j = s H J s,i B s J s, j Next thread block does same again for another station pair. Why not just use some matrix math library? 22

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j B 3 2 V i, j = s H J s,i B s J s, j Not quite the whole story... Non-separable baselinedependent effects must be modelled here too: Smearing terms Extended sources f (s,i, j) 23

Fermi to Kepler Correlate kernel (on compute 3.5 architecture, using CUDA 5.5) 43 registers (single precision) 68 registers (double precision) Must load from global memory: Stokes parameters (4 values per source) Direction cosines (3 values per source) Extended source parameters (3 values per source) Station coordinates (8 values per thread block) Jones matrices (2 x 8 values per source) Computes rotation matrix, two sinc functions, one exponential, three vector products, and two Jones complex matrix products. Not very operationally dense, but lots of data to store in registers. Global memory load is bandwidth heavy! (N 2 reads for N stations) (Current baseline design makes this worse: 1024 stations!) 24

Fermi to Kepler Expecting big performance gains from reduced register pressure. Kernel time Simulation time Simulation time Precision double double single M2090 (Emerald) 9.44 s (ECC off) 1125 s (ECC on) 197 s (ECC on) K20c (Ruby) 5.02 s 516 s 231 s Speedup 1.9 x 2.2 x 0.85 x (?) 25

Inside Kepler K20 family (slide from NVIDIA GTC 2012) 26

Inside Kepler K20 family (slide from NVIDIA GTC 2012) L1 cache in Kepler no longer used for global memory loads! Profiler showed that performance was limited by bandwidth to L2 cache. 27

Jones matrix data structure Station i (slowest varying) float2! Source s (fastest varying) J s, i = a c 1 1 + + ia ic 2 2 } b d 1 1 + + ib id Using const restrict not enough! Data structure too complex for compiler to optimize load from global memory. Needed four explicit ldg(float2) or ldg(double2) instructions to make use of Kepler s read only data cache. 2 2 28

Fermi to Kepler Expecting big performance gains from reduced register pressure. Profiler showing >150 GB/s global memory bandwidth on K20c (theoretical max 208 GB/s). Kernel time Simulation time Simulation time Simulation time Simulation time Precision double double single double single M2090 (Emerald) 9.44 s (ECC off) 1125 s (ECC on) 197 s (ECC on) 1125 s (ECC on) 197 s (ECC on) K20c (Ruby) 5.02 s 516 s 231 s 292 s 124 s Speedup 1.9 x 2.2 x 0.85 x (?) 3.9 x 1.6 x 29

Example study: Modelling the impact of distant interfering sources AA have considerable sensitivity to sources outside primary beam. Strong function of frequency: Can we image at 600 MHz? Understand impact of interfering sources to a AA snapshot observation. Metric called (far) side-lobe confusion noise. With AA beams the signal from sources outside the field of interest is nonzero. The power from these sources is spread into the main field though their PSF side-lobes. Both the PSF and beam are a function of frequency and time. Known as confusion noise: millions of point sources which cannot be individually corrected for. This an important limit to the imaging performance of AAs. Region of Interest Side lobes Interfering sources 30

AA telescope configuration 800 y (North) [metres] 600 400 200 0 200 400 600 800 600 400 200 0 200 400 600 800 x (East) [metres] y (North) [metres] 20 15 10 5 0 5 10 256 antennas (courtesy N. Razavi) 693 stations (courtesy K. Grainge) 15 20 20 15 10 5 0 5 10 15 20 x (East) [metres] 31

AA station beams 100 MHz 600 MHz 32

Sky model The SKA will be more sensitive than any current telescope, so no all-sky models exist with enough sources. Generate a 2M source sky model with the correct statistics extrapolated from the VLSS catalogue (~68k sources). Log10 cumulative number count 7 6 5 4 3 2 1 0 1 0.5 0 0.5 1 1.5 2 Log10 flux bin [Jy] 33

Image of sidelobe confusion noise -28:00:00.0-28:00:00.0-30:00:00.0-30:00:00.0-32:00:00.0 100 MHz 20:00.0-40:00:00.0 10:00.0 14:00:00.0 15:00:00.0 50:00.0 13:50:0 10:00.0 40:00.0 30:00.0 20:00.0-40:00:00.0 10:00.0 14:00:00.0 15:00:00.0 50:00.0 13:50:0 10:00.0-28:00:00.0-30:00:00.0-34:00:00.0-32:00:00.0-34:00:00.0 30:00.0-36:00:00.0-35:00:00.0-38:00:00.0 50:00.0 10:00.0 15:00:00.0 38:00.0 36:00.0 40:00.0 34:00.0 30:00.0 32:00.0-40:00:00.0 40:00.0 30:00.0 20:00.0 10:00.0 14:00:00.013:50:0 10:00.0 15:00:00.0 50:00.0 30:00.0 32:00.0 14:30:00.0 34:00.0 28:00.0 36:00.0 26:00.0 38:00.0 15 deg -36:00:00.0-38:00:00.0-33:30:00.0 600 MHz -34:00:00.0-36:00:00.0-38:00:00.0 30:00.0-32:00:00.0-34:00:00.0-36:00:00.0 40:00.0-30:00:00.0-32:00:00.0-34:00:00.0 50:00.0 10:00.0 15:00:00.0-28:00:00.0-38:00:00.0 40:00.0 30:00.0 20:00.0-40:00:00.0 10:00.0 14:00:00.0 13:50:0-28:00:00.0-33:30:00.0-30:00:00.0-34:00:00.0-32:00:00.0-34:00:00.0 30:00.0-36:00:00.0-35:00:00.0-38:00:00.0-40:00:00.0 40:00.0 30:00.0 20:00.0 10:00.0 14:00:00.013:50:0 10:00.0 15:00:00.0 50:00.0 30:00.0 32:00.0 34:00.0 14:30:00.0 36:00.0 28:00.0 26:00.0 38:00.0-28:00:00.0-33:30:00.0-30:00:00.0-34:00:00.0-32:00:00.0-34:00:00.0 30:00.0 2.5 deg -36:00:00.0-35:00:00.0-38:00:00.0-40:00:00.0 20:00.0 10:00.0 14:00:00.013:50:0 30:00.0 14:30:00.0 28:00.0 26:00.0-33:30:00.0-33:30:00.0-33:30:00.0-34:00:00.0-34:00:00.0-34:00:00.0 30:00.0 30:00.0 30:00.0 34

Interfering (FSC) snapshot noise as a function of frequency 10 2 FSCN RMS [Jy/beam] 10 3 10 4 100 220 350 500 600 Frequency [MHz] 35

Summary Large scale SKA simulations are challenging. GPUs make them possible. Simulations are vital to Assess the evolving system design. Generate semi-realistic data products for tool-chain developers and for data flow testing. 36