OSKAR: Simulating data from the SKA

Similar documents
OSKAR-2: Simulating data from the SKA

S.A. Torchinsky, A. van Ardenne, T. van den Brink-Havinga, A.J.J. van Es, A.J. Faulkner (eds.) 4-6 November 2009, Château de Limelette, Belgium

Radio Interferometry Bill Cotton, NRAO. Basic radio interferometry Emphasis on VLBI Imaging application

Synthesis Imaging. Claire Chandler, Sanjay Bhatnagar NRAO/Socorro

Correlator Field-of-View Shaping

Lecture 17 Reprise: dirty beam, dirty image. Sensitivity Wide-band imaging Weighting

Imaging and Deconvolution

Primary Beams & Radio Interferometric Imaging Performance. O. Smirnov (Rhodes University & SKA South Africa)

OSKAR Settings Files Revision: 8

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

High Dynamic Range Imaging

A GPU based brute force de-dispersion algorithm for LOFAR

ERROR RECOGNITION and IMAGE ANALYSIS

ADVANCED RADIO INTERFEROMETRIC IMAGING

Controlling Field-of-View of Radio Arrays using Weighting Functions

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures

ALMA Memo 386 ALMA+ACA Simulation Tool J. Pety, F. Gueth, S. Guilloteau IRAM, Institut de Radio Astronomie Millimétrique 300 rue de la Piscine, F-3840

High dynamic range imaging, computing & I/O load

Adaptive selfcalibration for Allen Telescope Array imaging

Fast Holographic Deconvolution

Imaging and Deconvolution

Imaging and non-imaging analysis

Profiling & Tuning Applications. CUDA Course István Reguly

van Cittert-Zernike Theorem

Wide-field Wide-band Full-Mueller Imaging

Computational issues for HI

AA CORRELATOR SYSTEM CONCEPT DESCRIPTION

Modeling Antenna Beams

CUDA Experiences: Over-Optimization and Future HPC

COMMENTS ON ARRAY CONFIGURATIONS. M.C.H. Wright. Radio Astronomy laboratory, University of California, Berkeley, CA, ABSTRACT

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

GPU-Based Acceleration for CT Image Reconstruction

EVLA Memo #132 Report on the findings of the CASA Terabyte Initiative: Single-node tests

Pre-Processing and Calibration for Million Source Shallow Survey

3. Image formation, Fourier analysis and CTF theory. Paula da Fonseca

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Argus Radio Telescope Architecture

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

CASA. Algorithms R&D. S. Bhatnagar. NRAO, Socorro

Wideband Mosaic Imaging for VLASS

SKA Technical developments relevant to the National Facility. Keith Grainge University of Manchester

Antenna Configurations for the MMA

Basic optics. Geometrical optics and images Interference Diffraction Diffraction integral. we use simple models that say a lot! more rigorous approach

Thomas Abraham, PhD

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Final Exam. Today s Review of Optics Polarization Reflection and transmission Linear and circular polarization Stokes parameters/jones calculus

MASSACHUSETTS INSTITUTE OF TECHNOLOGY HAYSTACK OBSERVATORY

Waves & Oscillations

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Matrix Multiplication

Accelerating the acceleration search a case study. By Chris Laidler

Wide field polarization calibration in the image plane using the Allen Telescope Array

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Imaging Strategies and Postprocessing Computing Costs for Large-N SKA Designs

Winter College on Optics in Environmental Science February Adaptive Optics: Introduction, and Wavefront Correction

Continuum error recognition and error analysis

Preparatory School to the Winter College on Optics in Imaging Science January Selected Topics of Fourier Optics Tutorial

Simple Spatial Domain Filtering


Image Processing. Filtering. Slide 1

Waves & Oscillations

Technology for a better society. hetcomp.com

How accurately do our imaging algorithms reconstruct intensities and spectral indices of weak sources?

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Empirical Parameterization of the Antenna Aperture Illumination Pattern

Using a multipoint interferometer to measure the orbital angular momentum of light

The Future of High Performance Computing

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

Workhorse ADCP Multi- Directional Wave Gauge Primer

FFT Analysis. Document No :SKA-TEL-SDP Context PIP.CAS. Revision: 02. Author.. Stefano Salvini. Release Date:

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Visualization & the CASA Viewer

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Embedded Systems. Octav Chipara. Thursday, September 13, 12

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Designing for Performance. Patrick Happ Raul Feitosa

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

A Modified Algorithm for CLEANing Wide-Field Maps with Extended Structures

High performance Computing and O&G Challenges

NRAO VLA Archive Survey

Sky-domain algorithms to reconstruct spatial, spectral and time-variable structure of the sky-brightness distribution

SIMULATION AND VISUALIZATION IN THE EDUCATION OF COHERENT OPTICS

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA. Matthew Joyner, Jeremy Williams

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Polarization of Light and Polarizers ABSTRACT

ASKAP Pipeline processing and simulations. Dr Matthew Whiting ASKAP Computing, CSIRO May 5th, 2010

Applications of Piezo Actuators for Space Instrument Optical Alignment

Reflection and Refraction of Light

ALMA Antenna responses in CASA imaging

High Performance Computing on GPUs using NVIDIA CUDA

IRAM mm-interferometry School UV Plane Analysis. IRAM Grenoble

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Transcription:

OSKAR: Simulating data from the SKA Oxford e-research Centre, 4 June 2014 Fred Dulwich, Ben Mort, Stef Salvini 1

Overview Simulating interferometer data for SKA: Radio interferometry basics. Measurement equation basics. Structure of OSKAR. Experiences moving from Fermi to Kepler GPU architecture. Some recent simulation results. 2

Radio interferometry VLA (1973-1980) One-Mile Telescope (1964) First to use Earth-rotation aperture synthesis 3

Comparison with optical system Traditional optical telescope records image of the sky formed by lens (or mirror). Sky EM radiation from the sky Lens Image plane of lens 4

Comparison with optical system A radio interferometer samples the wave-front in the Fourier domain: Image formation done electronically. Sky EM radiation from the sky Array of detectors Processing Image formed by FT 5

Aperture arrays as stations Omni-directional antennas measure voltage signals from whole sky. Spatial filtering (electronic beam forming) to isolate a direction of interest. Advantages: Cost effective at low frequency No moving parts Fast scanning Multi-beaming capability Disadvantages: Sparse at high frequency Relatively high sidelobe levels Continually variable beam shape Continually variable beam polarisation 6

Modelling Challenges (1) AA have complex beam patterns that have to be modelled across whole sky 7

Modelling Challenges (2) Science goals demand very high sensitivity Require good understanding of instrumental characteristics Need comprehensive models of sky and telescope Very large instruments and sky model require HPC Design of SKA not yet finalised: simulator has to be flexible 8

Why simulate the SKA? Imaging performance depends strongly on how the detector elements are arranged. Aperture arrays have unique problems. Assess performance of evolving system design. Simulations can produce data challenges for pipeline developers. Ideas for SKA design have changed in recent years: Few large stations (11200 elements per station) Many small stations (256 elements per station) 9

Measurement Equation formalism A radio interferometer makes measurements of radiation in the Fourier domain (visibilities) for the true sky after various corruption effects, for example: Sky rotation (parallactic angle) Ionosphere Antenna pattern & shape of station beam The Hamaker-Bregman-Sault Measurement Equation of a radio interferometer can be used to simulate measured visibility data. Relies on concepts of: Source coherency matrix Jones matrix 10

Source coherency (brightness) matrix Source coherency matrix encapsulates source properties. Stokes parameters I, Q, U and V completely describe average polarisation of radiation from a source. Coherency matrix defined as 2x2 complex quantity for each source, s. Using linear polarisation basis: B s = " $ # $ I +Q U iv U + iv I Q % ' &' 11

Jones matrix Describes some physical effect on the radiation. For a single source, s, at a single receiving station, i. Jones matrix is another 2x2 complex quantity: Allows intermixing of polarisations. Allows modification of amplitude and phase of received electromagnetic wave. J s, i = a c 1 1 + + ia ic 2 2 b d 1 1 + + ib id 2 2 12

Jones matrix and Measurement Equation Gives modularity and makes complex simulations tractable: Jones matrices can be chained together. Allows us to separate different physical effects. Multiply matrices in order in which things actually happen: J = X s,i s i s, i s, i Visibility on baseline between stations i and j for all visible sources (s) is then: V i, j = s Y Z, J B J H s,i s s, j! 13

A pictorial Measurement Equation! B R Z E K V i, j = s H K s,i E s,i Z s,i R s,i B s R s, j Z H s, j E H H s, j K s, j 14

OSKAR overview (1) GPU-enabled software to produce simulated visibilities by direct evaluation of a measurement equation. Currently ~120000 lines of code, mostly C (some C++). Currently ~40 CUDA kernels/functions. Single or double precision computation available. Balance between highest performance and highest flexibility. Problem sizes vary hugely. Simulations need to run on many different systems. Minimize PCIe traffic: Copy input sky and telescope models to GPU memory. Intermediate data generated on the GPU and used without transfer to host. Host keeps track of pointers to GPU memory. Use GPU memory effectively as a giant cache. 15

OSKAR overview (2) Each source is independent with respect to all other sources. There are many sources in the sky Can trivially parallelise over sources. In general, each GPU thread works on one source. Easily guaranteed 10 4 10 5 threads for any given kernel launch. Most expensive steps: Station beam evaluation, for all stations. Compute limited (DFT). Cross-correlation step (visibility evaluation per baseline). Bandwidth limited (Kepler); register limited (double precision, Fermi). 16

Jones matrix data structure Station i (slowest varying) Source s (fastest varying) 17

Jones matrix data structure Station i (slowest varying) Source s (fastest varying) J s, i = a c 1 1 + + ia ic 2 2 b d 1 1 + + ib id OSKAR functions calculate each Jones matrix for each source at each station in GPU memory (used as scratchpad ). 2 2 18

Joining Jones matrices Station i (slowest varying) J = X, Y s,i s i s, i Source s (fastest varying) = x Trivially parallel: each thread does one colour 19

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j 3 B 2 V i, j = s H J s,i B s J s, j Exploits the fact that XY = Y H X H Each thread block computes result for one baseline, or one correlation between two stations, for all sources. Each thread does a subset of sources. Accumulates partial sum into shared memory. Result of final accumulation into global memory. 20

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j B V i, j = s H J s,i B s J s, j 3 2 Multiply together numbered cells. Accumulate results. One shared memory location per colour/thread (partial sum). Final step adds different colours, putting result into global memory. 21

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j 3 B 2 V i, j = s H J s,i B s J s, j Next thread block does same again for another station pair. Why not just use some matrix math library? 22

Forming visibilities ( correlator ) Source s (fastest varying) Station i 1 Station j B 3 2 V i, j = s H J s,i B s J s, j Not quite the whole story... Non-separable baselinedependent effects must be modelled here too: Smearing terms Extended sources f (s,i, j) 23

Fermi to Kepler Correlate kernel (on compute 3.5 architecture, using CUDA 5.5) 43 registers (single precision) 68 registers (double precision) Must load from global memory: Stokes parameters (4 values per source) Direction cosines (3 values per source) Extended source parameters (3 values per source) Station coordinates (8 values per thread block) Jones matrices (2 x 8 values per source) Computes rotation matrix, two sinc functions, one exponential, three vector products, and two Jones complex matrix products. Not very operationally dense, but lots of data to store in registers. Global memory load is bandwidth heavy! (N 2 reads for N stations) (Current baseline design makes this worse: 1024 stations!) 24

Fermi to Kepler Expecting big performance gains from reduced register pressure. Kernel time Simulation time Simulation time Precision double double single M2090 (Emerald) 9.44 s (ECC off) 1125 s (ECC on) 197 s (ECC on) K20c (Ruby) 5.02 s 516 s 231 s Speedup 1.9 x 2.2 x 0.85 x (?) 25

Inside Kepler K20 family (slide from NVIDIA GTC 2012) 26

Inside Kepler K20 family (slide from NVIDIA GTC 2012) L1 cache in Kepler no longer used for global memory loads! Profiler showed that performance was limited by bandwidth to L2 cache. 27

Jones matrix data structure Station i (slowest varying) float2! Source s (fastest varying) J s, i = a c 1 1 + + ia ic 2 2 } b d 1 1 + + ib id Using const restrict not enough! Data structure too complex for compiler to optimize load from global memory. Needed four explicit ldg(float2) or ldg(double2) instructions to make use of Kepler s read only data cache. 2 2 28

Fermi to Kepler Expecting big performance gains from reduced register pressure. Profiler showing >150 GB/s global memory bandwidth on K20c (theoretical max 208 GB/s). Kernel time Simulation time Simulation time Simulation time Simulation time Precision double double single double single M2090 (Emerald) 9.44 s (ECC off) 1125 s (ECC on) 197 s (ECC on) 1125 s (ECC on) 197 s (ECC on) K20c (Ruby) 5.02 s 516 s 231 s 292 s 124 s Speedup 1.9 x 2.2 x 0.85 x (?) 3.9 x 1.6 x 29

Example study: Modelling the impact of distant interfering sources AA have considerable sensitivity to sources outside primary beam. Strong function of frequency: Can we image at 600 MHz? Understand impact of interfering sources to a AA snapshot observation. Metric called (far) side-lobe confusion noise. With AA beams the signal from sources outside the field of interest is nonzero. The power from these sources is spread into the main field though their PSF side-lobes. Both the PSF and beam are a function of frequency and time. Known as confusion noise: millions of point sources which cannot be individually corrected for. This an important limit to the imaging performance of AAs. Region of Interest Side lobes Interfering sources 30

AA telescope configuration 800 y (North) [metres] 600 400 200 0 200 400 600 800 600 400 200 0 200 400 600 800 x (East) [metres] y (North) [metres] 20 15 10 5 0 5 10 256 antennas (courtesy N. Razavi) 693 stations (courtesy K. Grainge) 15 20 20 15 10 5 0 5 10 15 20 x (East) [metres] 31

AA station beams 100 MHz 600 MHz 32

Sky model The SKA will be more sensitive than any current telescope, so no all-sky models exist with enough sources. Generate a 2M source sky model with the correct statistics extrapolated from the VLSS catalogue (~68k sources). Log10 cumulative number count 7 6 5 4 3 2 1 0 1 0.5 0 0.5 1 1.5 2 Log10 flux bin [Jy] 33

Image of sidelobe confusion noise -28:00:00.0-28:00:00.0-30:00:00.0-30:00:00.0-32:00:00.0 100 MHz 20:00.0-40:00:00.0 10:00.0 14:00:00.0 15:00:00.0 50:00.0 13:50:0 10:00.0 40:00.0 30:00.0 20:00.0-40:00:00.0 10:00.0 14:00:00.0 15:00:00.0 50:00.0 13:50:0 10:00.0-28:00:00.0-30:00:00.0-34:00:00.0-32:00:00.0-34:00:00.0 30:00.0-36:00:00.0-35:00:00.0-38:00:00.0 50:00.0 10:00.0 15:00:00.0 38:00.0 36:00.0 40:00.0 34:00.0 30:00.0 32:00.0-40:00:00.0 40:00.0 30:00.0 20:00.0 10:00.0 14:00:00.013:50:0 10:00.0 15:00:00.0 50:00.0 30:00.0 32:00.0 14:30:00.0 34:00.0 28:00.0 36:00.0 26:00.0 38:00.0 15 deg -36:00:00.0-38:00:00.0-33:30:00.0 600 MHz -34:00:00.0-36:00:00.0-38:00:00.0 30:00.0-32:00:00.0-34:00:00.0-36:00:00.0 40:00.0-30:00:00.0-32:00:00.0-34:00:00.0 50:00.0 10:00.0 15:00:00.0-28:00:00.0-38:00:00.0 40:00.0 30:00.0 20:00.0-40:00:00.0 10:00.0 14:00:00.0 13:50:0-28:00:00.0-33:30:00.0-30:00:00.0-34:00:00.0-32:00:00.0-34:00:00.0 30:00.0-36:00:00.0-35:00:00.0-38:00:00.0-40:00:00.0 40:00.0 30:00.0 20:00.0 10:00.0 14:00:00.013:50:0 10:00.0 15:00:00.0 50:00.0 30:00.0 32:00.0 34:00.0 14:30:00.0 36:00.0 28:00.0 26:00.0 38:00.0-28:00:00.0-33:30:00.0-30:00:00.0-34:00:00.0-32:00:00.0-34:00:00.0 30:00.0 2.5 deg -36:00:00.0-35:00:00.0-38:00:00.0-40:00:00.0 20:00.0 10:00.0 14:00:00.013:50:0 30:00.0 14:30:00.0 28:00.0 26:00.0-33:30:00.0-33:30:00.0-33:30:00.0-34:00:00.0-34:00:00.0-34:00:00.0 30:00.0 30:00.0 30:00.0 34

Interfering (FSC) snapshot noise as a function of frequency 10 2 FSCN RMS [Jy/beam] 10 3 10 4 100 220 350 500 600 Frequency [MHz] 35

Summary Large scale SKA simulations are challenging. GPUs make them possible. Simulations are vital to Assess the evolving system design. Generate semi-realistic data products for tool-chain developers and for data flow testing. 36