Detection and Visualisation of Radio Frequency Interference

Size: px

Start display at page:

Download "Detection and Visualisation of Radio Frequency Interference"

Adele Pitts
6 years ago
Views:

Detection and Visualisation of Radio Frequency Interference A project for the course MAM4007W Mathematics of Computer

Chosen Mark Requirement Analysis and Design 0 20 0 Theoretical Analysis 0 25 10 Experiment Design and Execution 0 20 10

Work 10 15 15 Quality of Report Writing and Presentation 10 10 10 Adherence to Project Proposal and Quality of

1 Detection and Visualisation of Radio Frequency Interference A project for the course MAM4007W Mathematics of Computer Science Supervised by: Michelle Kuttel, Sarah Blyth, and Anja Schroeder Philippa Hillebrand HLLPHI012 Category Min Max Chosen Mark Requirement Analysis and Design Theoretical Analysis Experiment Design and Execution System Development and Implementation Results, Findings and Conclusion Aim Formulation and Background Work Quality of Report Writing and Presentation Adherence to Project Proposal and Quality of Deliverables Overall General Project Evaluation Total Computer Science University of Cape Town South Africa October 2014

2 2 Abstract Radio Frequency Interference (RFI) comprises all the unwanted signals in the radio spectrum detected by a radio telescope, which interfere with the, often much fainter, astronomical signals. A clear separation of RFI and astronomical signals through detection is necessary for scientific observations. The majority of RFI signals are produced on Earth, although the sun is also a source. Earth-based signals cannot always simply be tracked down and switched off, as they are often major communications channels, for systems like television and mobile telephones. Therefore a major requirement in radio astronomy is to detect and characterize, and then mitigate, these signals. This can be done manually, but it is much more efficient to do so computationally. Here we highlight and compare six detection/mitigation algorithms, aiming for their possible combination and implementation for the MeerKAT telescope. This is in a radio quiet area of the Karoo, the same site as for the international Square Kilometre Array (SKA) project. The SKA will be the world s largest radio telescope, consisting of thousands of receivers of which the MeerKAT is a precursor. Here we describe the design and implementation of two RFI detection methods based on methods chosen from the literature. Acknowledgements Thank you to supervisors Michelle Kuttel, Sarah Blyth and Anja Schroeder for taking the time to read every draft chapter and discuss the design and testing of the system. Thank you to the SKA for funding and supplying data for this project.

3 Contents 1 Introduction Problem Statement Research Question Approach Background Radio Frequency Interference Characterization and detection of RFI Methods for RFI mitigation RFI detection algorithms Radio Astronomy Data Spectral Kurtosis SumThreshold AOFlagger Morphological Algorithm Spatial Filtering Characterization Methods Conclusions Design Goals Approach SumThreshold Algorithm Final SumThreshold algorithm Surface fitting and dilation Variable window size System Architecture Software Development Input and Output Algorithm Analysis SumThreshold Variable Window Comparison Implementation Languages and libraries SumThreshold Algorithm Prototype Optimisation

4 4 CONTENTS Optimisation Surface and dilation algorithm (discontinued) Variable window algorithm Prototype Optimisation Conclusions Validation Methods Determining success Tests Discussion Results Case Study Case Study Case Study Profiling Discussion Conclusions and Future Work 42 Appendices 45 A Validation Results 46 B SumThreshold 51 C Variable Window 54 D Supporting Code 58 D.1 SaveDataAsImage D.2 transpose D.3 plotstuff D.4 makesmooth D.5 noise

5 List of Figures 2.1 A signal from the LOFAR test station. Top left: Signal with no interference. Top Right: Signal with interference. Bottom: RFI removed by spatial filtering using different filter types (see 2.4.6).[4] Map of frequency restricted regions in the Karoo [7] Diagram showing the structure of the detection and visualisation system a) Data in the general shape of real data, but with RFI removed, and noise added. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm a) Data in the general shape of real data, with a single RFI spike, and noise. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm a) Data with a baseline of zero, and noise. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm a) Data with a baseline of zero, a family of spikes, and noise. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm a) A zoomed view of the family of spikes. b) a zoomed view of the stripes displayed by the variable window mask a) The data explored. b) The mask produced by the SumThreshold algorithm. c) The mask produced by the variable window algorithm The SumThreshold mask searching for transient RFI A complete mask, created by combining the SumThreshold (transposed and not) and the variable window masks An ordinary data set with typical RFI in the frequency domain, and minimal RFI in the time domain, along with the masks produced by the algorithms designed A data set with typical RFI in the frequency domain, and two lines of RFI in the time domain, along with the masks produced by the algorithms designed An arbitrary data set which shows the necessity of the detection algorithms to see all the RFI within the data, along with the masks produced by the algorithms designed

6 6 LIST OF FIGURES A.1 a) Data with a baseline of zero, and one small section shifted up. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. The small shift up is treated as a baseline wiggle by both algorithms A.2 a) Data with a baseline of zero, very low noise, with a broadband signal. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. The SumThreshold method is not sensitive to broadband RFI A.3 a) Data with a baseline of zero, low noise, with a broadband signal. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. The SumThreshold method is not sensitive to broadband RFI A.4 a) Data with a baseline of zero, and one small section shifted up. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. The small shift up is treated as a baseline wiggle by both algorithms A.5 a) Data with a baseline of zero, and one narrow spike. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. The spike is accurately flagged by both methods. 50

7 Chapter 1 Introduction The MeerKAT project in Carnavon in the Karoo is a radio telescope which forms the precursor to the Square Kilometre Array (SKA) South Africa project. This telescope detects up radio frequency signals from celestial bodies further away than any we have previously observed, and will consist of an array of telescope dishes larger than ever combined before. Unfortunately, radio signals are not produced only by celestial bodies, but also by man-made objects, and are used extensively for communication. These man-made signals which interfere extensively with the signals being observed from outer space are known as Radio Frequency Interference (RFI) and can be observed in the data as amplitude spikes on a frequency spectrum. If these spikes are not noticed, the data is treated as trustworthy and astronomers may assume that the spikes are an interesting phenomenon, when actually they are just the neighbour starting his car. For this reason, we apply signal processing techniques to the data, trying to find the signals which are statistically significantly different from the underlying noise. This underlying noise is the actual astronomical data and so it is particularly important that the noise is not marked as RFI. Output is some form of mask, which allows the astronomers to know which channels are corrupted, and which contain viable information. The simplest form of RFI detection is known as thresholding. This means setting some value above which the data is flagged as RFI. In some circumstances this is done symmetrically, so if the data is lower than some value it is also flagged. There are more advanced forms of detection, which mostly build on the idea of thresholding. 1.1 Problem Statement The aim of this project is to adapt and compare two methods of RFI detection which can then be used in characterisation of the RFI, and to determine the type of RFI which is being produced in the environment. The effectiveness of the algorithms will be evaluated according to how fast they are able run, how sensitive they are to changes in the data, how much known RFI they are able to detect and how many false positives there are in the output mask. 1.2 Research Question RFI and astronomical signals (radio waves produced by a source) both come in many different forms, which makes detection of RFI difficult. Also, the amount of data recorded 7

8 8 CHAPTER 1. INTRODUCTION by a radio telescope is very large, so any detection algorithm is required to be as efficient as possible. As such the following question will be investigated: Is it possible to adapt an existing detection algorithm to the supplied data, and add any form of characterization to that algorithm? As seen in the past work, Offringa et. al.[15, 16, 17] have worked extensively on detecting RFI in array type telescopes. The data for this project, however, is collected, formatted, and stored differently. The challenge is therefore to apply existing methods to the new data. The characterization of a particular signal has not been researched in as great depth, and so to design an algorithm to appropriately characterize the signals may be beyond the time scale of this project. 1.3 Approach The approach taken to solve this problem follows a simple route. We begin by looking into the solutions produced by others on similar problems, and examine the specifics of the SKA project and relate the solutions to the problem. We then choose two appropriate methods to implement for this project. These algorithms are described in detail in Chapter 2. The next step is to design the algorithms to work with the data collected on the MeerKAT site. This process is shown in Chapter 3. In Chapter 4 we document the process of building up the system, and developing the chosen algorithms. This includes the development of a new algorithm which makes use of previous ideas, but implements them differently. From there we move to the validation of the code in Chapter 5, and a discussion of the results of running the code on real data in Chapter 6.

9 Chapter 2 Background The very first radio map of the skies was produced in 1942 by Reber, an amateur, who was intrigued by Jansky s observations of the Milky Way in 1932[2]. Since then radio telescopes have developed to the point where there are two main types: there are large single dish telescopes (such as Arecibo[2]) and arrays of smaller dishes (such as the Low- Frequency Array (LOFAR), which has recently become fully operational[4, 16]). These telescopes make two different types of observations: active; utilizing RADAR 1 technology, and passive; picking up radio waves emitted by astronomical sources. As radio telescopes become larger and more sensitive, more data on astronomical objects can be collected, leading to a much better understanding of the universe[2]. To this end, the Square Kilometre Array (SKA) telescope has been commissioned, which will be the largest radio telescope in the world. The SKA project, first discussed in 1993, has grown into a global project, located in South Africa and Australia[21]. The MeerKAT project in the Karoo is the precursor to the South African part and will become a part of the SKA. MeerKAT will consist of 64 antennae, with the maximum distance between the dishes being 8 km. The first dish was raised on the 27th of March 2014[24]. 2.1 Radio Frequency Interference Radio frequency interference (RFI) is electromagnetic interference (EMI) from signals in the radio frequencies of the electromagnetic spectrum (Figure 2.1). As EMI can be caused by any type of electrical circuit, sources of RFI are abundant. What is considered RFI is subjective, and dependent on the type of observation being made[5]. Because RFI signals (transmitted by a source) are mostly much stronger than the astronomical signals observed, this can overload the sensitive receivers, causing errors in the calibration of the signal. RFI can also occur in the same frequency as an astronomical signal, causing ambiguities and ripples in the observed spectrum. From the antenna, the radio signal is converted from analogue to digital, and then correlated with the signals from other antennae to create a complete picture of the observations. RFI can be created anywhere along this path. RFI can be categorized into two broad groups: narrow-band RFI (intentional transmissions, such as television signals, or FM radio signals) and broadband RFI (unintentional transmissions, such as those emitted by electric circuits, and power lines)[19]. It may be possible to find and shield broadband sources more easily than narrow-band. 1 Radio Detection And Ranging, used originally for detecting aircraft. 9

10 10 CHAPTER 2. BACKGROUND Strong RFI signals (a signal is transmitted by a source) can completely drown out weaker signals of astronomical importance in the same channel (a channel is a set of frequencies grouped together to make data storage easier). This can cause a significant loss of data, as it can be necessary to completely ignore all signals found on the channel. The L-Band (around 1420 MHz) is important because this is where spectral lines denoting neutral hydrogen in a celestial body can be observed. Unfortunately, there are many RFI signals in this channel[10], which make it difficult to differentiate valid signals and interference. The effects of the interference are shown clearly in Figure 2.1. The diversity of radio signals makes the detection of RFI challenging. Figure 2.1: A signal from the LOFAR test station. Top left: Signal with no interference. Top Right: Signal with interference. Bottom: RFI removed by spatial filtering using different filter types (see 2.4.6).[4] 2.2 Characterization and detection of RFI Every RFI signal has unique characteristics which can be used to characterize the signal, such as strength, geographical location or position of the source, polarization, direction, orientation, periodicity over time, bandwidth, frequency distributions, modulation and encoding[18, 5]. Some characteristics, such as strength, are easy to identify for a single source, while others, such as polarization, are more difficult to determine. Characterizing a signal is useful, as it becomes much easier to locate the source and either shield it, have it switched off, or deal with the signal during the processing of the data collected.

11 2.3. METHODS FOR RFI MITIGATION 11 Knowing the polarization of the signal is useful, because astronomical signals are very weakly polarized, if at all, whereas RFI is usually strongly polarized. Characterization also impacts on the detection algorithms, in that two signals can be compared if they have been characterized, and so it is possible to determine RFI signals through similarity with known RFI. It is also good to be aware of the radio atmosphere around the sensitive equipment, and to know when something changes, to make prediction of behaviour easier[18]. RFI detection and characterization algorithms aim to detect RFI, characterize and identify the signals for ease of management, flag the signals[19] and then mitigate the RFI in a manner that will lose the least possible astronomical data. This can be done by removing a point (frequency, time) which has been flagged[4]. 2.3 Methods for RFI mitigation One of the easiest ways to minimize RFI around a radio telescope is to declare the region to be radio quiet, which means that no transmitting or receiving radio devices are permitted within a certain distance of the telescope. This is difficult to enforce, as discovered at the Medicina telescope in Italy[3]: the growth of nearby cities cannot be curbed, and often the radio quiet region is encroached upon. For this reason, the MeerKAT and SKA projects are based in the Karoo, far from any large settlements. The Astronomical Advantage Act[7] enforces restrictions on frequencies by region (shown in Figure 2.2). These regions surround the core of the MeerKAT and SKA projects. Unfortunately, it is not possible to find a region with absolute radio quiet, independent of the regulations set in place. Satellites and aeroplanes still pass overhead and some signals are very long distance, such as television signals. So, beyond radio quiet regions, the International Telecommunications Union (ITU) has released a table specifying frequency allocations for different types of communication. This table is then specialized by the communications authority of each country to be applicable. The Independent Communications Authority of South Africa (ICASA) has released the relevant table for South Africa[8]. This allocates a relatively small number of narrow frequency bands to radio astronomy, and, commonly, these bands are shared with other communications areas. It is illegal for a signal to be transmitted outside of the allocated frequency, so signals detected in these areas can be turned off by the authorities (ICASA). If radio astronomy wishes to make use of a wide bandwidth of frequencies, there will be a large amount of RFI present which is entirely legal[5]. If it is impossible to avoid RFI, detection and mitigation schemes need to be developed. The many different types of RFI lead to many different detection algorithms. Many of these are designed for specific instruments or projects, and so are not directly suitable for all astronomical data. These algorithms can be compared, combined, and modified to provide a situation-specific solution. 2.4 RFI detection algorithms The simplest form of detection is thresholding, which tests the strength of a signal against a predefined threshold value and, if the signal is above that value, flags it as RFI. This can be done with any kind of data, but is often done after the Fast Fourier Transform (FFT) part of the correlation process. The algorithms we consider here all work post-correlation,

12 12 CHAPTER 2. BACKGROUND Figure 2.2: Map of frequency restricted regions in the Karoo [7] meaning they can work on saved data. A reference antenna (as is used at the MeerKAT site) is used to compare signals to aid in the detection Radio Astronomy Data Radio astronomy data is collected for many different purposes and in many different ways, with different emphases. The data could be collected on a satellite (the SMOS project[18]) or by Earth-based radio telescopes. These telescopes vary design, data collected, and collection method. They can be single- or multi-dish (or beam) and they can observe actively or passively. Telescopes with either a multi-beam feed, or an array of dishes, have their data correlated and so calculate a covariance matrix, as used in the spatial filtering technique. The SKA project will be made up of a large array of passively observing dishes[21], so we may use these techniques. Currently on the MeerKAT site, there is a single antenna observing the environment for RFI. This antenna is used to detect and characterize, as well as visualize RFI before the full telescope becomes operational, which will make RFI mitigation easier later. Therefore at this stage techniques which require multiple antennae will not be usable Spectral Kurtosis Spectral Kurtosis (SK) is a statistical method used for RFI detection, which is usually applied to time-averaged, non-gaussian data, but can be extended to other data types[1].

13 2.4. RFI DETECTION ALGORITHMS 13 SK is a thresholding method, applied either during or after the FFT[11], and is applied equally well in frequency and time domains. The spectral kurtosis can be calculated using V 2 k = σ2 k, (2.1) µ 2 k where σk 2 is the variance and µ2 k is the mean of the power spectral density (PSD). A sample with no RFI will have Vk 2 = 1. The mean and the variance is done for M spectral estimates P ki, where k is the channel number and i = 1,..., M. These are used to calculate the instantaneous power spectral density (PSD) S 1 and the squared spectral power S 2, S 1 = S 2 = Then the mean and variance are given by M P ki (2.2) i=1 M Pki. 2 (2.3) i=1 This gives The variance of V 2 k µ k = S 1 M (2.4) and σ 2 k = MS 2 S 2 1 M(M 1). (2.5) Vk 2 = M ( ) MS2 1. (2.6) M 1 S1 2 is then calculated, and compared to the expected value of var(v 2 k ) = { 24/M, k = 0, N 4/M, k = 1,..., (N 1)[12], (2.7) where N is the Nyquist rate associated with the sampling rate. The Nyquist rate is the minimum rate at which a signal can be sampled without introducing errors, it is twice the highest frequency in the signal.[23] If the variance is significantly different from a baseline level such as the median, the signal can be considered to be RFI. A good implementation of the SK method requires a full understanding of all the statistical techniques involved. The complexity of the algorithm depends on how many windows of size M are used, so giving a worst case O(N 2 ) complexity. The SK method is suitable for use on any type of data, but, as a purely statistical method, does not hold much interest from a computing perspective SumThreshold The SumThreshold method is a form of combinatorial thresholding, which means that samples are not only checked individually for high values, but also are combined to check if two or more neighboring samples are all above a slightly lower threshold value. The flagging function for frequency and time can then be given as flagν M if i {0...M 1} : j {0...M 1} R(ν + (i j) ν, t) > χ M (2.8) flagt M if i {0...M 1} : j {0...M 1} R(ν, t + (i j) t) > χ M, (2.9)

14 14 CHAPTER 2. BACKGROUND where M is the number of samples in a combination, χ is the threshold value, and R(ν, t) is the value of the sample at time t and frequency ν. A sample can be flagged in either time or frequency. Once a sample has been flagged, its value is changed for future combinations to be the average threshold size (χ M ). This lowers the frequency of false positives in the flagged data[14]. The difficulty in this approach lies in calculating appropriate χ values, although it may be possible to make use of Spectral Kurtosis to do this. Much as for SK, the complexity depends on both the number of samples, and the iterations through combinations up to size M giving a worst case O(N 2 ) time. The method can be used on any type of data, making it suitable for this project, and the main interest would be in comparison with SK AOFlagger The AOFlagger is an algorithm which was implemented at LOFAR in 2010[16]. As input, it takes information on a single polarization or set of Stokes I data (an integration technique used to join all the data into one spectrum). The amplitudes are calculated, and a thresholding technique is used to generate the first flags. The channels (frequency) or time steps (time) are then compared based on root-mean-square (rms) values, to flag the outliers. The data are then fitted to a 2D Gaussian surface, again to smooth out outliers. The process is then iterated, increasing the strictness of the threshold until the data converges on the surface. A dilation is then performed on the data, flagging further RFI around the edges of the channels or time steps, on the supposition that not all the RFI was found. At this point the flags can be compared with the original data[13]. The most difficult part of the AOFlagger is in the dilation step, ensuring that the flags are not spread too far, thus flagging channels or time steps unnecessarily. The complexity is certainly above linear time, as the data are fitted to a 2D surface, which requires at the least O(N log N). The algorithm is certainly suitable for the data produced, and has interest when considered in conjunction with basic thresholding, techniques as well as more advanced techniques Morphological Algorithm This algorithm was designed for the LOFAR telescope, and so is suitable for extension to the SKA, as the two telescopes are similar. It combines a number of techniques, such as thresholding and the use of reference antennae, which give good estimates of frequencies in which there is RFI. The algorithm utilizes the fact that most RFI signals are parallel either to the time or the frequency axis. It builds particularly on the AOFlagger algorithm ( 2.4.4). The key concepts used are morphology, and the idea of a scale invariant rank (SIR) operator. An SIR operator is a mathematical operator ρ for which ρ(λx) = λρ(x), where λ is a constant. The operator must be of the SIR type, because RFI signals are themselves scale invariant, meaning that they are not affected by scaling the data. The SIR operator is applied after a basic flagging method and is applied separately to time and frequency. The operator can be defined as ρ(x) = {[Y 1, Y 2 ) : X [Y 1, Y 2 ) (1 η)(y 2 Y 1 )}, (2.10) where [Y 1, Y 2 ) is a half open interval in either frequency or time and η gives the aggressiveness of the operator (meaning that η = 0 will flag nothing, and η = 1 will flag everything). To recombine the time and frequency channels, either a union of the two can be taken,

15 2.5. CHARACTERIZATION METHODS 15 or the operator can be applied sequentially in each channel. The sequential combination is more aggressive than the union and the order of the sequence will influence what is flagged[17]. A full proof of the scale invariance of operator ρ, as well as the full algorithm in O(N) time was given by Offringa et al[17]. The algorithm has been fully implemented in O(N) time. It is predominantly of theoretical interest and uses interesting mathematical concepts Spatial Filtering Spatial filtering aims to reduce the RFI levels in a sample to the point where they can be seen through to view the astronomical signals. Thus it is a mitigation method, although it can be used for combined detection and mitigation. The spatial filtering technique is based on the manipulation of the covariance matrix C formed by correlation of the data from multiple channels (dishes or beams). The background astronomical signals and system noise are considered to be Gaussian noise[4]. The eigenvector and eigenvalue matrices are found, giving C = UΛU H, where Λ is a diagonal matrix containing the eigenvalues in descending order, U is the matrix of eigenvectors and U H is its Hermitian conjugate. The Hermitian conjugate is found by taking the transpose of the matrix and replacing each value with its complex conjugate. Either it is assumed that the RFI has the strongest signal in the system and the first value in Λ is given a null value, or a filter is applied. The filter can be either a projection filter, which gives a projection of C onto the noise subspace (giving C = P N CP N, where P N is the projection) or a subtraction filter, where the projection onto the interference subspace is subtracted from the system (giving C = C P I CP I )[9]. 2.5 Characterization Methods RFI characterization methods draw heavily on the detection methods, as a signal cannot be characterized before it has been detected, and many of the principles in detection and characterization are the same. Some characteristics are easy to find. The SMOS project[18], which measures the brightness temperature (BT) of Earth, found the power of the RFI signal to be directly proportional to the BT. They also suggested that the direction of a pulsating source can be found by analyzing the pulses. The SMOS is satellite-based, so not directly applicable to the SKA, but many of the principles remain the same. Another group working with synthetic aperture radar[10] match the frequency and time stability of a signal to a known signal, from a specific type of radar tower. They also correlated geographical position. Unfortunately, the majority of their characterization is done as part of the detection of the signals. 2.6 Conclusions In Table 2.1, the six algorithms in Section 2.4 are compared based on a number of factors. In this table it can be seen that some algorithms are more suitable to the data collected at MeerKAT than others, and some are more complete (or higher level) than others. The morphological algorithm ( 2.4.5) is an example of a high-level algorithm suitable for the data. This algorithm does have room for extension, however, as the sub-algorithm of SumThreshold ( 2.4.3) could be replaced with another, and characterization methods

16 16 CHAPTER 2. BACKGROUND could be added to it. The spatial filtering algorithm ( 2.4.6) is even higher level, going so far as to mitigate the RFI. This could quite easily be extended, by only applying the algorithm to samples already flagged, but it is not suitable for this data, as it requires an array of inputs. The main methods of interest are the flagging methods and the characterization methods. It would be interesting to combine these methods to flag data not just as RFI, but as a specific type of RFI, which could then be visualized, so that the radio environment of the MeerKAT area can be more intuitively understood. Table 2.1: Comparison of algorithms discussed in 2.4, with scores given from 1-3 for each section, where a higher score means a higher value in that section. Algorithm Features Difficulty Complexity Suitability Interest Spectral Kurtosis Morphological Algorithm AO Flagger SumThreshold Spacial Filtering

17 Chapter 3 Design 3.1 Goals In this project, we aim to determine an effective method for detecting and possibly characterising Radio Frequency Interference (RFI) in radio signals, particularly focussing on signals received from radio telescopes. As the data files are large ( values per file), the method should be efficient in terms of both time and space. 3.2 Approach We select two algorithms to be implemented and compared in discussion with the two Astronomy supervisors. We choose algorithms based on Table 3.1, with a focus on high suitability and low difficulty. Table 3.1: Comparison of algorithms discussed in Chapter 2, with scores given from 1-3 for each area, where a higher score means a higher value in that area. Algorithm Features Difficulty Complexity Suitability Interest Spectral Kurtosis Morphological Algorithm AO Flagger SumThreshold Spatial Filtering The two algorithms chosen are the SumThreshold method and the AOFlagger method. From these chosen algorithms the final methods are developed SumThreshold Algorithm The SumThreshold method is a combinatorial thresholding method which, rather than simply checking if a value is above a specific threshold, includes the surrounding values in the computation. The flagging part can be given in equation form as flagν M if i {0...M 1} : j {0...M 1} R(ν + (i j) ν, t) > χ M flagt M if i {0...M 1} : j {0...M 1} R(ν, t + (i j) t) > χ M. This can be put into pseudo code as follows: 17

18 18 CHAPTER 3. DESIGN Set M, sum, threshold, maxm For each window of size M_i (from M to maxm stepping 2, 4, 8,...) set count = number of unflagged values in the window set sum = sum of all these values if (sum > count * threshold) OR (sum < -count * threshold) set a flag on unflagged values set values to be an average move the window to the right set the threshold for the new window position Final SumThreshold algorithm After a few optimisations during the implementation phase (Chapter 4) a final algorithm is left which is slightly changed from the original. The pseudo code is as follows: Set M, sum, threshold, maxm For each window of size M_i (from M to maxm stepping 2, 4, 8,...) set sum = sum over j in window (value at j) - chi if (sum > 0) set a flag on unflagged values set values to be an average move the window to the right set the threshold for the new window position Surface fitting and dilation The AOFlagger method is an extension of a thresholding method which adds surface fitting and dilation to the algorithm. The initial algorithm attempted was: Row-wise repeat: Do (at least twice): - Replace flagged data with median value - Create spline interpolated surfaces - Compare values between interpolations, flagging those beyond a certain level. end do end repeat However, after beginning the implementation of the system, this algorithm was discarded, and a brand new one developed, the variable window method Variable window size The variable window algorithm was developed in discussion with supervisors Sarah and Anja, and is an attempt to find an efficient way of checking all the data. This method makes use of a smoothed surface which underlies the data at every time period. This surface is used as a comparison, or base threshold value and then a two-dimensional window is placed over the data. The size of this window depends on the rate of change of the standard deviation of the data. So, when the standard deviation is changing quickly, we assume that there are larger spikes in the data, and so use a smaller window. If the

19 3.2. APPROACH 19 standard deviation changes slowly, we assume that there are fewer large spikes, and so use a larger window. The algorithm in pseudo code is: repeat process a number of times Set window size and position loop through entire surface find standard deviation (s.d) find change in s.d (average over three) flag window (look for points 5 * s.d out) vary window size shift window on end loop end repeat System Architecture The system is originally described by the following diagram, where the greyed out parts deal with visualisation be implemented by Gerard Nothnagel, and so are not discussed in this work: The modifications to the algorithms lead to a modification of the system architecture, and so the final system is described by the following diagram:

20 20 CHAPTER 3. DESIGN The RATTY data is data collected on the MeerKAT site, and is provided for use by Christopher Schollar. The smoothed surface included in the grey oval is the underlying surface which the data is compared to in the variable window algorithm. It is required as an input to the system Software Development We follow an iterative approach to the development of the software, focussing first on the requirements, then producing a detailed design, then implementing the design before testing and validation. This cycle is then repeated until a satisfactory result is achieved. We follow this process because the original algorithms have already been documented and so the initial design phase consists predominantly of adapting the algorithm to the situation. This means that the design should be finished before implementation begins, which is based on the waterfall process. The implementation is managed using version control through Git. This allows for more flexible implementation and experimentation. Sections are tested as they are developed, drawing from the concept of unit testing to ensure code integrity. 3.3 Input and Output Input is in the format of HDF5 files (Chapter 2, Section 2.4.4) containing data collected on the MeerKAT site, which have a row for every time at which data was collected and a column for every frequency channel. The output is a new HDF5 file which contains a mask for the original file. This means that, if a value is flagged with 0 it has no RFI, and if it is flagged with 1 there is RFI of some type.

21 3.4. ALGORITHM ANALYSIS Algorithm Analysis SumThreshold Input: array hight m, width n 1. load into memory 2. Create matching mask 3. loop m times 4. while run < r 5. while pos + l/2 <= n (step size= l/2) 6. set chi 7. flag window of length l 8. save and close files This gives a very basic description of the algorithm which can be used to find the complexity. Lines 1, 2 and 8, will add a term of O(3 n m) to the complexity. The loop beginning in line 3 adds a factor of m. The loop in line 4 adds a constant factor r, bringing the complexity up to O(r m+3 n m). The choosing of the threshold value can be viewed as a non-trivial constant time calculation which takes time c. Flagging the window depends on its length l, and takes O(2l) when the window must be flagged. The loop in line 5 contributes a factor of 2n l. So over all the complexity of the algorithm is: Complexity = O(r m 2n (2l + c) + 3 n m) l = O(r m 2n (2 + c) + 3 n m) = O((4r + 2rc + 3) m n) = O(k m n) Where k is some fairly large constant. It is this factor k which must be optimised to improve the performance of the algorithm Variable Window Input: array hight m, width n 1. load two files into memory 2. Create matching mask 3. loop k times 4. while time position + 1/2 time dimension <= m (step 1/2 time dimension) 5. while frequency position + 1/2 frequency dimension <= n (step 1/2 frequency dimension) 6. calculate sigma 7. flag window 8. change window size if appropriate (time never changes) 9. save and close files As in 3.4.1, lines 1, 2, and 9 give a single term for the complexity, of O(4 m n). Line 3 gives a factor of k. Line 4 gives a factor of 2m c 1, where c 1 is the smallest value for the time

22 22 CHAPTER 3. DESIGN dimension. Line 5 gives a factor of 2n c 2, where c 2 is the smallest value for the frequency dimension. Lines 6 8 can be calculated in some constant time, say c 3. This gives the overall complexity as: Complexity = O(4 m n + k 2m c 1 = O((4 + 4c 3k c 1 c 2 ) m n) = O(K m n) 2n c 2 c 3 ) Where K is a non-trivial constant factor. This factor K is what must be optimised for best performance Comparison To properly compare the algorithms analysed in and we look at their constant factors. To do this we assign values to various constants which can be found in the code listings in Appendix B and C. We have for the SumThreshold method: And for the variable window method: k = 4r + 2rc + 3 = c + 3 = c K = 4 + 4c 3k c 1 c 2 = c = 4 + 3c We can assume that the values c and c 3 are comparable, as they are both constant factors which contain the focus of the code. Thus we can show the difference between k and K as: K k = 4 + 3c ( c) = 27 + c( ) = c K = k c Since c must be a positive value, it should come as no surprise that the variable window method runs significantly faster than the SumThreshold method.

23 Chapter 4 Implementation Her we discuss the implementation of the two algorithms chosen for development. These will further be tested and validated with simulated data (Chapter 5) and then have case studies performed of how they react to real data (Chapter 6). With regards to the design of the system, some aspects changed during the implementation process. The original design can be seen in Figure 4.1. The first algorithm, SumThreshold, incorporated a separate script to transpose the data file before inputting it to the algorithm. The second algorithm underwent major changes over the course of the implementation, as it is a more complex system. The original design of fitting the data to a surface was modified into a system which uses a smoothed surface and the standard deviation of a window to search out the larger and smaller RFI in different ways, whilst allowing for noise. Thus the shaded oval was added to Figure 4.1. More details on the implementation of each algorithm follow below. 4.1 Languages and libraries The algorithms are all implemented in the Python programming language. This language was chosen as the developers at the SKA already work predominantly in Python, and there are many very powerful scientific libraries written for Python[6], such as the h5py library[20] which allows a Python script to read a file in the HDF5 file format. This is necessary since the astronomical data is all stored in HDF5 files, which compress the data to a storable size. Another library used extensively is Numpy[22], a library which allows advanced manipulation of arrays of data, making finding statistical values for a section of data simple. 4.2 SumThreshold Algorithm This is the first algorithm implemented. A full description of the original algorithm can be found in Chapter 2. This algorithm is a combinatorial thresholding method, which means that, rather than only checking if every data point is above some threshold value χ, a window is moved across the data. Then, for every pass the sum of unflagged data points is compared to a lowered threshold value. This can be shown by the equations flagν M if i {0...M 1} : j {0...M 1} R(ν + (i j) ν, t) > χ M flagt M if i {0...M 1} : j {0...M 1} R(ν, t + (i j) t) > χ M. 23

24 24 CHAPTER 4. IMPLEMENTATION Figure 4.1: Diagram showing the structure of the detection and visualisation system Prototype 1 To begin, we created a crude implementation of the algorithm as described in Chapter 3. In this process, some issues were discovered, such as: 1. It can be tricky to decide on a suitable thresholding value (χ) above which all data points are flagged. We decided to use statistical relevance checks. So the χ value is set to be the median value increased by 5σ. This is then decreased with each pass to 3σ, which gives the lowered χ value for the combinatorial step. This was discovered to be necessary when performing validation tests with a smooth increasing surface: the surface was flagged as RFI when the slope was positive. 2. The algorithm proves to be unreliable on the edges of the data, an acknowledged issue in signal processing, as there is insufficient data around the specific points to get an accurate median value. This issue can only be solved by counting the fringe values as unreliable, and measuring a little wider than is required for measurement. 3. Part of the optimization of this algorithm is determining the initial window size, as well as the rate of growth and the number of passes to be made. It is unreasonable to begin with a window of size one, which checks every point, as this will slow the algorithm to worse than real time. To achieve real time, a single row of data should be processed in a second or less. This problem is considered in the optimisations listed below.

25 4.3. SURFACE AND DILATION ALGORITHM (DISCONTINUED) The original implementation took a long time to run Optimisation 1 The χ calculation was modified to be independent of the window size, which reduced the χ calculation to constant time. This reduces the complexity of the algorithm, and increases its speed. The second optimisation changed the step size of the window. As stepping through every point multiple times is inefficient, this was changed to begin with a step size of 6, which then increases with every pass so that larger windows have a larger step size. This optimisation cut running time down to below 30 minutes on average for data collected over one hour, giving half real time. This also allows for a user to decide whether accuracy or time is more important. The step size and number of passes can be parametrized to allow a user to set their own values:then a user looking for high accuracy will set the step size very low and the number of passes higher Optimisation 2 Further testing after optimisation 1 brought some glaring errors to light. Optimisation 1 was tested before the correct version of χ was used. Changing the value of χ meant that the subroutine for the combinatorial flagging in the window needed to be reviewed. The original method was: flag window: for point in array: if point not flagged: add to sum increase counter if abs(sum) > abs(counter * chi): flag entire window This does not work, as the majority of the data is negative, but there are some RFI spikes which are positive. To account for these negatives the algorithm was modified to flag window: for point in array: if point not flagged: sum += (point - chi) if sum > 0: flag entire window This gives the sum of the distances of the points from the threshold value. So points which are below the value will have a negative impact, and those which are above will have a positive impact on the sum. The main method was also modified to force the step size to be equal to the length of the window, ensuring that no points are ever missed. 4.3 Surface and dilation algorithm (discontinued) This algorithm was originally going to be based on the AOFlagger model explained in Chapter , which fits the data to some surface and then expands all the flagged

26 26 CHAPTER 4. IMPLEMENTATION areas based on the assumption that RFI will occur in larger regions than are actually detected. Using spline interpolation it is possible to create a smoothed version of the data against which to perform checks. The pseudo code for this original algorithm is as follows Row-wise repeat: Do (at least twice): - Replace flagged data with median value - Create spline interpolated surfaces - Compare values between interpolations, flagging those beyond a certain level. end do end repeat This method, which interpolates every row of data takes a very long time to run (around 2 days). This time is unacceptable and does not give sufficient accuracy to warrant longer than real time processing. After discovering that it takes about 20s to perform a spline interpolation on a single row of data, the algorithm was rethought a little. This involved preprocessing the data to act as a smoothed surface, which in itself takes a long time, but that one file can be used to process many different data sets. This improves the speed greatly, moving to take only a few minutes to perform a detection for an hour s worth of data. Unfortunately the algorithm has moved away from the original idea, and no longer is very different from a basic thresholding algorithm. At this point we discarded this algorithm and moved to the variable window method. 4.4 Variable window algorithm Prototype 1 A system of a fixed window size which ran through the entire surface was initially built. Some noteworthy errors were made during the implementation. The first error was that the window moved diagonally through the data, only looking at a band from the top left to the bottom right. The second thing that required some time to solve was the necessity of a standard deviation calculation with a predefined mean value. There is a standard method for doing this in Python 3, but not in Python 2. Porting the algorithm to Python 3 was considered, but the difference between Python 3 and Python 2 is sufficiently large that this became infeasible very quickly. So it was necessary to write a standard deviation method Optimisation From the system with a fixed window, adding in the window variations was fairly simple. The system takes three steps to produce an accurate representation of the rate of change in the standard deviation, and uses an average over the last three steps to calculate this value. A look up is then used to determine the window size of the next step. The smallest window is , which is for a rate of change greater than 2. The middle size is , for a rate above 1. The largest size is for all smaller rates of change.

27 4.5. CONCLUSIONS 27 The process is repeated three times, which gives reasonable accuracy, and only requires about 20 minutes of processing time on data collected in one hour. 4.5 Conclusions The implementation necessitated adaptation of the original design, which allowed for better algorithms to be developed. These algorithms are based on the ideas used in the originals, but puts the ideas together in a slightly different way which is more appropriate for the data being processed. Through this procedure, we ended up with two viable algorithms the SumThreshold algorithm and the variable window algorithm. These algorithms were then thoroughly checked, as is discussed in the next chapter.

28 Chapter 5 Validation Here we discuss validation of the two algorithms developed for RFI detection, the Sum- Threshold method and the variable window method. In running simulations we are able to accurately determine which RFI each algorithm is able to detect, and to what extent that RFI is detected. We are also able to determine the sensitivity of each algorithm, and the accuracy in the flagging, which will give us a feel for when we can expect false positives from the algorithm. The tests discussed in this chapter contain the most important information found through the simulations. Further test results are provided in Appendix A. 5.1 Methods To check that the output is correct, we create specific test data containing values similar to the real input data, containing RFI signals in known positions. This is done through generating Gaussian noise with fake RFI signals added in known places. If the implementation correctly flags this data, it can be considered to be working correctly. We made use of spline fitting and medians to smooth data. This gives a realistic smooth surface which can be used to test as the data will be based on such a shape. The method of smoothing the data was as follows: Run a window across all data, finding the median. Create a data file containing these median values. Perform a Bivariate Spline on the data file, smoothing value: s=0.5 Save the new spline surface as the smoothed data surface. On top of this smoothed surface, white noise is added, which emulates astronomical data, which is often treated as Gaussian noise[12]. We create a different type of surface to test methods on as well. This is done with a perfectly flat surface, where the baseline of the values is set to zero. We then set specific values to be RFI spikes which should be picked up by the detection method Determining success To determine success we will first compare the results of running each algorithm over time and frequency separately, we will then compare the results of the algorithms to each other. We will also compare each algorithm to a kurtosis algorithm (supplied). We will 28

29 5.2. TESTS 29 then decide if the differences in performance allow for the combination of the algorithms to create a better method, and include characterisation of the signals. 5.2 Tests Figure 5.1: a) Data in the general shape of real data, but with RFI removed, and noise added. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. Test one tested the algorithms on a non-uniform surface with no RFI. This was done with data in the same shape as the real data, but which was smoothed and then had noise added, as can be seen in Fig 5.1a. The expected outcome for this test was two perfectly empty masks. The SumThreshold method provided exactly that (Fig 5.1b), but the variable window method has flagged areas of the data (Fig 5.1c). The bands marked 2 and 3 in Figure 5.1c correspond to points where the data steps steeply, suggesting that there is a weakness in the variable window method when the data steps. This leads to the inclusion of false positives in the mask in these areas. This means that the method should be validated either with another detection algorithm, or by observing the data. The bands marked 1 and 4 have less obvious causes, but the cause is similar. They are both on a steep upward slope of the data, and so the algorithm is very sensitive to this type of change. Test two is designed to test the sensitivity of both algorithms to narrow band, isolated RFI. This is done by using the same surface as in test one, with a single frequency channel including RFI. This is shown in Figure 5.2a, at the point labelled RFI.

30 30 CHAPTER 5. VALIDATION Figure 5.2: a) Data in the general shape of real data, with a single RFI spike, and noise. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. Figure 5.2 shows the increased sensitivity of the variable window method, as the single spike is flagged in a very narrow band, whereas the SumThreshold method picks it up with a wider band. This is because the SumThreshold method will be unable to pick up the spike until it s window has expanded to a size larger than the width of the spike, and the entire spike is enclosed by the window. This shows that it would be possible to increase the accuracy of the SumThreshold method by decreasing the step size of the window position, although this would also increase the run time. The third test makes use of the second surface. Data with a baseline of zero, and Gaussian noise is tested. The expected outcome of this test is that neither algorithm will flag any data points. Figure 5.3 shows that this test produced the expected results. Both masks are completely empty. This is a good thing, as it means that the algorithms are checking only for signals which differ from the median value by a statistically significant amount. This is relevant as one of the earliest iterations of the development did not have this property, and would have found RFI in this surface. The fourth test is designed to test the sensitivity of the algorithms to a group of spikes close together. This stands the danger of being treated as noise with a very high standard deviation by the algorithms. We expect that the SumThreshold method will flag the entire band in which the group is found, and the variable window method will flag the individual spikes. Figure 5.4b shows that the SumThreshold method did not flag any values. This

31 5.2. TESTS 31 Figure 5.3: a) Data with a baseline of zero, and noise. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. Figure 5.4: a) Data with a baseline of zero, a family of spikes, and noise. b) The mask produced by the SumThreshold Algorithm. c) The mask produced by the variable window algorithm. means that the algorithm falls into the trap of treating a group of spikes as noise with a very high standard deviation. The variable window method, however, acts as expected, flagging something in the same place as the RFI. A zoomed in view (Fig 5.5) shows that the variable window in fact flagged exactly the RFI, and so created a distinctive striped pattern. The final test is designed to test the sensitivity of the algorithms in the time dimension,

32 32 CHAPTER 5. VALIDATION Figure 5.5: a) A zoomed view of the family of spikes. b) a zoomed view of the stripes displayed by the variable window mask. Figure 5.6: a) The data explored. b) The mask produced by the SumThreshold algorithm. c) The mask produced by the variable window algorithm as there is some RFI which is visible only in the time domain. To perform this test The surface resembling real data is used, and three rows are seeded with RFI, uniformly along the row. This produces the three horizontal lines marked RFI in Figure 5.6a. We expect that the SumThreshold algorithm will be unable to detect these lines be-

33 5.3. DISCUSSION 33 Figure 5.7: The SumThreshold mask searching for transient RFI Figure 5.8: A complete mask, created by combining the SumThreshold (transposed and not) and the variable window masks. cause it processes the data set one row at a time. However, we shall test the performance of the SumThreshold method on a transpose of the dataset, and expect to see that the RFI is detected as it is narrow band in the time domain. We expect also that the variable window method will successfully flag the three lines. We see the results of this test in Figure 5.6. As expected, there is nothing flagged in Fig 5.6b, which is the SumThreshold mask. We can see also that the variable window method performed almost as expected. There are three lines marked RFI in Fig 5.6c, however, these lines have gaps in them, which seem to be related to the false positive band just to their left. Figure 5.7 shows the results of running the SumThreshold algorithm on the transposed data. We can see that it performed as expected, flagging the three lines accurately. We can see also in Figure 5.8 that the two masks that detected the horizontal lines detected them in the same place. 5.3 Discussion We can see through this validation process that both the SumThreshold and the variable window methods have some shortcomings. The SumThreshold method is not as sensitive as it is expected to be, and does not deal appropriately with groups of RFI. The variable window method is perhaps too sensitive, as it is liable to show false positives when the data increases steeply. The overall sensitivity of the variable window method makes it a very good first method to use for a broad understanding of where there is RFI in the data, but a second method should be used to validate the broader bands of flagged data, as this is where the false positives appear. The two methods are both capable of finding RFI in both the time

34 34 CHAPTER 5. VALIDATION domain and the frequency domain, although the variable window method does so more efficiently. The SumThreshold method, while requiring that the algorithm is run on the transpose of the data, does find the time based RFI as accurately as it finds frequency based RFI. These tests reveal some of the characteristics of the RFI detected by the two algorithms. The SumThreshold algorithm detects predominantly isolated, narrowband RFI. The variable window method is very sensitive, and is able to detect almost any RFI, but has some false positives which could be mistaken for broadband RFI. This can be used when combating the source of the RFI. Knowing the type of the RFI being detected is helpful in narrowing down the possible sources.

35 Chapter 6 Results In this chapter we discuss some case studies for the two RFI detection algorithms developed, to determine the qualitative difference in the algorithms. Each case highlights some feature or difference in the algorithms. The first study is a standard case, with very few features in the time dimension and standard features in the frequency dimension. The next study adds RFI in the time dimension. The third study is a fairly arbitrary choice of data, to show interesting effects. All three case studies are taken from real data collected on the site of the MeerKAT telescope, and made available by the SKA offices. We then compare performance of the two algorithms, and relate this back to the analysis in Chapter Case Study 1 In this instance we consider a fairly typical data set Figure 6.1aa. There are no major discrepancies in the time domain, and the RFI seen in the frequency domain is present in most of the other data sets as well. The first thing to note is that there are some lines on the data image which are clearly RFI, One set of such lines is marked on the figure. These lines show up as much darker than the rest of the image as they have a higher intensity. There are also some broad bands where the intensity increases, these are not broadband RFI, but rather baseline wiggles, also marked on the figure. These should not be flagged, as they correspond to trends in the baseline noise, rather than unusual occurrences. The second mask (Figure 6.1ac), produced by the variable window algorithm, contains more flagged points. The first bar of masked points in the variable window mask (labelled 1) is not shared by the SumThreshold mask, and is also not visible in the data. The variable window method does create false positives under certain circumstances (see Chapter 5), and it is possible that this data includes those. There follow after that some faint lines, marked 2. There are more of these lines on the variable window mask (Fig 6.1ac), but there are a few on the SumThreshold mask (Fig 6.1ab) as well. Those which are on both masks are tall thin isolated spikes. These are picked up very effectively by both algorithms, and can be used as a type of characterisation, as we know that if both methods flag the spike it must be a narrow and isolated type of RFI. The SumThreshold algorithm cannot detect a family of spikes (marked 3), because the system sees them as simply noise with a very high standard deviation. However, the variable window method is able to pick them up, as a specific type of RFI, leaving a 35

36 36 CHAPTER 6. RESULTS (a) a) The data explored. b) The mask produced by the SumThreshold algorithm. c) The mask produced by the variable window algorithm (b) The SumThreshold mask searching for transient RFI (c) A complete mask, created by combining the SumThreshold (transposed and not) and the variable window masks. Figure 6.1: An ordinary data set with typical RFI in the frequency domain, and minimal RFI in the time domain, along with the masks produced by the algorithms designed. distinctive pattern of stripes. This is one of the greatest failings of the SumThreshold method in its current form. A better version would be able to detect that the group of

Removing Radio Frequency Interference in the LOFAR using GPUs

Vrije Universiteit Amsterdam Master Thesis Removing Radio Frequency Interference in the LOFAR using GPUs Author: Linus Schoemaker Supervisors: Dr. Rob. V. van Nieuwpoort Alessio Sclocco A thesis submitted