Machine Learning & Science Data Processing

Machine Learning & Science Data Processing Rob Lyon robert.lyon@manchester.ac.uk SKA Group University of Manchester

Machine Learning (1) Collective term for branch of A.I. Uses statistical tools to make decisions over data intelligently. Appearance of intelligence is an illusion backed up by functions. So how does it work?

Machine Learning (2) Redesign Intensity Phase Training Validation MAP Design Evaluate Build

Machine Learning & SDP

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science.

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science. Includes pulsar timing, single pulse search (transients signals, FRBs) and periodicity search (pulsars).

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science. Includes pulsar timing, single pulse search (transients signals, FRBs) and periodicity search (pulsars). For single pulse and periodicity search, CSP data products describe potential observations of astrophysical phenomena - new discoveries?

Existing Approaches Applied to candidate selection for single pulse and periodicity search. Supervised machine learning algorithms. Learn from fixed-size training sets of examples. Variety of algorithms used, with varying computational requirements. 15 10 5 0 Sub-Band Number 30 20 10 0 Sub-Int Number 4.0e-032.0e-03 0 Amplitude Sub-Bands Bin Number Sub-Integrations Bin Number Profile A 0 30 60 90 120 150 180 B 0 30 60 90 120 150 180 C 0 0.20 0.40 0.60 0.80 1 1.20 1.40 Phase 100 80 60 40 30 20 10 0 DM SNR Period-DM Plane -4.0e+10-2.0e+10 0 2.0e+10 4.0e+10 6.0e+10 Topo Period Offset from 361.91384238ms DM Curve 0 100 200 300 400 500 600 700 800 900 DM Bary Period Bary Freq Topo Period Topo Freq Acc Jerk DM width SNR Initial 361.921792414 2.763027872 361.91384238 2.763088566 0.00 0.00 74.91 N/A 34.36 D E Optimised 361.922122747 2.76302535 361.91384238 2.763088566 0.00 0.00 72.73 0.01 45.04 ms Hz ms Hz m/s/s m/s/s/s cm/pc periods periods

Which method?

Issues With ML at Scale

Issues with ML at SKA Scale ML typically very accurate if training data is good. Problems: 1. Not optimised to minimise resource use. 2. Non-adaptive, and retraining with more examples can be expensive (depending on the algorithm). Other issues: training data hard to obtain, classifier decisions often hard to audit.

Practical Issues with ML at SKA Scale (1) Adapting to distributional change advantageous. Rapidly adapting to new training examples important for discovery. 1 Period SNR Mean value (normalised) 0.8 0.6 0.4 0.2 Feature 1 Feature 2 Feature 12 Feature 13 Feature 14 0 DM 200 400 600 800 1000 Sample number

Structural Issues with ML at SKA Scale (2) Performance issues 3000 Small sample problem Small disjuncts due to imbalance. How to acquire training examples? How to incorporate expert feedback? How to audit Chi squared value from fitting sine-squared curve to pulse profile 2500 2000 1500 Class overlap classifications? 1000 500 Pulsar Non-pulsar 500 1000 1500 Chi squared value from fitting sine curve to pulse profile. 1500

Exploring solutions

Possible SDP Approach Data stream learning methods. Very low resource requirements. Able to adapt to concept drift. Able to learn from new training examples observed over time. Classifier Predict Store a) Incremental Model Train Test Report Performance b) Batch Model Report Performance

Incremental Stream Prototype Process OCLDs Pre-process data ML prediction Source matching 1 1 1 1 1 1 Archive 1 1 1 OCLD Spouts Key n 0 n 1 2 n 2 2 n 3 2 n 4 2 n 5 2 n 6 2 n 7 n 8 Telescope Manager / Science Teams Bolts Sift per beam Feature extraction Second sift Alerts Spouts Auditing data Primary data Auditing nodes (deactivated during normal execution)

Stream Classifier: GH-VFDT

Algorithm Performance See Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Lyon et al, accepted for publication in MNRAS, 2016. Other results in: Hellinger Distance Trees for Imbalanced Streams, Lyon et al., ICPR, 2014.

Prototype Performance Local tests on a single machine (Quad Core i7) 1,000 candidates per second Approx. 2 seconds for a candidate to move through the system. Relatively easy to configure & program. Possible problems: you may want to send more than a tuple. you may want data to go backwards.

Open Questions How to acquire unlimited supply of accurately labelled data? To what extent does pulsar data drift? Will a multi-class approach improve classification performance? What will the final compute environment look like? How do we keep track of training data and validate our approaches? Are our features good enough?

Summary Structural issues with ML at scale Practical issues with ML at scale Success depends on understanding the data Prototype pipelines under development Thanks for listening! robert.lyon@manchester.ac.uk

Candidate Numbers

Tree Classification Example x 1 x 2 x 3 x 4 x 5 x n h(cm) w(kg) l(cm) h > 50? Predictions x 1 x 2 x 3 32.1 4.8 41.5 67.9 14.2 109 35.3 5.1 48.5 No Yes x 1 x 2 x 3 A B A No w > 5? l > 49? Yes No Yes A B A B

Apache Storm Cluster Master node Nimbus Zookeeper Zookeeper cluster Zookeeper Zookeeper JVM JVM Worker Process Worker Process Executor Executor Spout Spout Bolt Bolt JVM Worker Process Executor Spout Bolt Executor Executor Executor Bolt Bolt Bolt Worker node Worker node Worker node Supervisor Supervisor Supervisor JVM JVM Worker Process Worker Process Executor Executor Spout Spout Bolt Bolt JVM Worker Process Executor Spout Bolt Executor Executor Executor Bolt Bolt Bolt 1 n