Machine Learning & Science Data Processing

Similar documents
Streaming & Apache Storm

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Apache Storm. Hortonworks Inc Page 1

Scalable Streaming Analytics

Paper Presented by Harsha Yeddanapudy

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Tutorial: Apache Storm

SDP Memo 40: PSRFITS Overview for NIP

Apache Storm: Hands-on Session A.A. 2016/17

REAL-TIME ANALYTICS WITH APACHE STORM

Practical Machine Learning Agenda

Twitter Heron: Stream Processing at Scale

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Adaptive selfcalibration for Allen Telescope Array imaging

Installing and Configuring Apache Storm

Flying Faster with Heron

Data Processing for the Square Kilometre Array Telescope

10/26/2017 Sangmi Lee Pallickara Week 10- B. CS535 Big Data Fall 2017 Colorado State University

Data Analytics with HPC. Data Streaming

SKA Regional Centre Activities in Australasia

Strategies for real-time event processing

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

FACT-Tools. Processing High- Volume Telescope Data. Jens Buß, Christian Bockermann, Kai Brügge, Maximilian Nöthe. for the FACT Collaboration

Hardware Acceleration of Pulsar Search on FPGAs using OpenCL

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

REAL-TIME PEDESTRIAN DETECTION USING APACHE STORM IN A DISTRIBUTED ENVIRONMENT

A Study of Load Balancing between Sensors and the Cloud for a Real-Time Video Streaming Analysis Application Framework

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Time Domain Alerts from LSST & ZTF

SKA SDP : A snapshot of recent technical directions and conclusions

PARALLEL CLASSIFICATION ALGORITHMS

Computational issues for HI

Hortonworks Cybersecurity Package

Data mining and Knowledge Discovery Resources for Astronomy in the Web 2.0 Age

A Decision Support System for Automated Customer Assistance in E-Commerce Websites

UMP Alert Engine. Status. Requirements

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Square Kilometre Array

PaaS SAE Top3 SuperAPP

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University

Working with Storm Topologies

New Zealand Involvement in Solving the SKA Computing Challenges

Priority Based Resource Scheduling Techniques for a Multitenant Stream Processing Platform

Scale-Out Algorithm For Apache Storm In SaaS Environment

This Time. fmri Data analysis

CorrelX: A Cloud-Based VLBI Correlator

A Systematic Overview of Data Mining Algorithms

STORM AND LOW-LATENCY PROCESSING.

Big Data Infrastructures & Technologies

5 member consortium o University of Strathclyde o Wideblue o National Nuclear Laboratory o Sellafield Ltd o Inspectahire

2/20/2019 Week 5-B Sangmi Lee Pallickara

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

ASKAP Pipeline processing and simulations. Dr Matthew Whiting ASKAP Computing, CSIRO May 5th, 2010

Evaluation of Apache Kafka in Real-Time Big Data Pipeline Architecture

Applying Model Predictive Control Architecture for Efficient Autonomous Data Collection and Operations on Planetary Missions

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

NOWADAYS, many modern applications require largescale

An Evaluation of Data Stream Processing Systems for Data Driven Applications

Community edition(open-source) Enterprise edition

Accelerating Convolutional Neural Nets. Yunming Zhang

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

The PRISM infrastructure System Architecture and User Interface

Source, Sink, and Processor Configuration Values

INTERFERENCE. where, m = 0, 1, 2,... (1.2) otherwise, if it is half integral multiple of wavelength, the interference would be destructive.

Functional Comparison and Performance Evaluation 毛玮王华峰张天伦 2016/9/10

Accelerating the acceleration search a case study. By Chris Laidler

Adaptive Online Scheduling in Storm

Introduction to Data Mining and Data Analytics

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Introduction to Parallel Programming

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

WP 14 and Timing Sync

Developing A Universal Radio Astronomy Backend. Dr. Ewan Barr, MPIfR Backend Development Group

Over the last few years, we have seen a disruption in the data management

Specialist ICT Learning

ME scope Application Note 36 ODS & Mode Shape Animation

AudioSet: Real-world Audio Event Classification

Data Quality Monitoring at CMS with Machine Learning

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Apache Flink. Alessandro Margara

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

ksa 400 Growth Rate Analysis Routines

SKA Technical developments relevant to the National Facility. Keith Grainge University of Manchester

Audio-coding standards

Status hardware, software and what next? Harry Keizer 1, Simon Bijlsma 1, Daniela de Paulis 1,3,Marc Wolf 1,2.

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

Computing seismic QC attributes, fold and offset sampling in RadExPro software Rev

Challenges in Data Stream Processing

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Hadoop ecosystem. Nikos Parlavantzas

DESIGN AND IMPLEMENTATION OF VISUAL FEEDBACK FOR AN ACTIVE TRACKING

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

1 Case study of SVM (Rob)

Hortonworks Cybersecurity Platform

Intelligent Services Serving Machine Learning

IPTA student workshop 2017: Day 1, preparing timing data with psrchive

Spark Overview. Professor Sasu Tarkoma.

GeoLas Consulting All rights reserved. LiDAR GeocodeWF Rev. 03/2012 Specifications subject to change without notice.

Transcription:

Machine Learning & Science Data Processing Rob Lyon robert.lyon@manchester.ac.uk SKA Group University of Manchester

Machine Learning (1) Collective term for branch of A.I. Uses statistical tools to make decisions over data intelligently. Appearance of intelligence is an illusion backed up by functions. So how does it work?

Machine Learning (1) Collective term for branch of A.I. Uses statistical tools to make decisions over data intelligently. Appearance of intelligence is an illusion backed up by functions. So how does it work? f(x) f(x) 0 f(x) 00 Observe Collect Data Build Model

Machine Learning (2) Redesign Intensity Phase Training Validation MAP Design Evaluate Build

Machine Learning & SDP

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science.

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science.

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science.

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science. Includes pulsar timing, single pulse search (transients signals, FRBs) and periodicity search (pulsars).

Machine Learning & SDP SDP converts / filters CSP data in to products useful for science. Includes pulsar timing, single pulse search (transients signals, FRBs) and periodicity search (pulsars). For single pulse and periodicity search, CSP data products describe potential observations of astrophysical phenomena - new discoveries?

Existing Approaches Applied to candidate selection for single pulse and periodicity search. Supervised machine learning algorithms. Learn from fixed-size training sets of examples. Variety of algorithms used, with varying computational requirements. 15 10 5 0 Sub-Band Number 30 20 10 0 Sub-Int Number 4.0e-032.0e-03 0 Amplitude Sub-Bands Bin Number Sub-Integrations Bin Number Profile A 0 30 60 90 120 150 180 B 0 30 60 90 120 150 180 C 0 0.20 0.40 0.60 0.80 1 1.20 1.40 Phase 100 80 60 40 30 20 10 0 DM SNR Period-DM Plane -4.0e+10-2.0e+10 0 2.0e+10 4.0e+10 6.0e+10 Topo Period Offset from 361.91384238ms DM Curve 0 100 200 300 400 500 600 700 800 900 DM Bary Period Bary Freq Topo Period Topo Freq Acc Jerk DM width SNR Initial 361.921792414 2.763027872 361.91384238 2.763088566 0.00 0.00 74.91 N/A 34.36 D E Optimised 361.922122747 2.76302535 361.91384238 2.763088566 0.00 0.00 72.73 0.01 45.04 ms Hz ms Hz m/s/s m/s/s/s cm/pc periods periods

Existing Approaches Applied to candidate selection for single pulse and periodicity search. Supervised machine learning algorithms. Learn from fixed-size training sets of examples. Variety of algorithms used, with varying computational requirements.

Which method?

Issues With ML at Scale

Issues with ML at SKA Scale ML typically very accurate if training data is good. Problems: 1. Not optimised to minimise resource use. 2. Non-adaptive, and retraining with more examples can be expensive (depending on the algorithm). Other issues: training data hard to obtain, classifier decisions often hard to audit.

Practical Issues with ML at SKA Scale (1) Adapting to distributional change advantageous. Rapidly adapting to new training examples important for discovery. 1 Period SNR Mean value (normalised) 0.8 0.6 0.4 0.2 Feature 1 Feature 2 Feature 12 Feature 13 Feature 14 0 DM 200 400 600 800 1000 Sample number

Structural Issues with ML at SKA Scale (2) Performance issues 3000 Small sample problem Small disjuncts due to imbalance. How to acquire training examples? How to incorporate expert feedback? How to audit Chi squared value from fitting sine-squared curve to pulse profile 2500 2000 1500 Class overlap classifications? 1000 500 Pulsar Non-pulsar 500 1000 1500 Chi squared value from fitting sine curve to pulse profile. 1500

Exploring solutions

Possible SDP Approach Data stream learning methods. Very low resource requirements. Able to adapt to concept drift. Able to learn from new training examples observed over time. Classifier Predict Store a) Incremental Model Train Test Report Performance b) Batch Model Report Performance

Incremental Stream Prototype Process OCLDs Pre-process data ML prediction Source matching 1 1 1 1 1 1 Archive 1 1 1 OCLD Spouts Key n 0 n 1 2 n 2 2 n 3 2 n 4 2 n 5 2 n 6 2 n 7 n 8 Telescope Manager / Science Teams Bolts Sift per beam Feature extraction Second sift Alerts Spouts Auditing data Primary data Auditing nodes (deactivated during normal execution)

Stream Classifier: GH-VFDT

Algorithm Performance See Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Lyon et al, accepted for publication in MNRAS, 2016. Other results in: Hellinger Distance Trees for Imbalanced Streams, Lyon et al., ICPR, 2014.

Prototype Performance Local tests on a single machine (Quad Core i7) 1,000 candidates per second Approx. 2 seconds for a candidate to move through the system. Relatively easy to configure & program. Possible problems: you may want to send more than a tuple. you may want data to go backwards.

Open Questions How to acquire unlimited supply of accurately labelled data? To what extent does pulsar data drift? Will a multi-class approach improve classification performance? What will the final compute environment look like? How do we keep track of training data and validate our approaches? Are our features good enough?

Summary Structural issues with ML at scale Practical issues with ML at scale Success depends on understanding the data Prototype pipelines under development Thanks for listening! robert.lyon@manchester.ac.uk

Candidate Numbers

Tree Classification Example x 1 x 2 x 3 x 4 x 5 x n h(cm) w(kg) l(cm) h > 50? Predictions x 1 x 2 x 3 32.1 4.8 41.5 67.9 14.2 109 35.3 5.1 48.5 No Yes x 1 x 2 x 3 A B A No w > 5? l > 49? Yes No Yes A B A B

Apache Storm Cluster Master node Nimbus Zookeeper Zookeeper cluster Zookeeper Zookeeper JVM JVM Worker Process Worker Process Executor Executor Spout Spout Bolt Bolt JVM Worker Process Executor Spout Bolt Executor Executor Executor Bolt Bolt Bolt Worker node Worker node Worker node Supervisor Supervisor Supervisor JVM JVM Worker Process Worker Process Executor Executor Spout Spout Bolt Bolt JVM Worker Process Executor Spout Bolt Executor Executor Executor Bolt Bolt Bolt 1 n