FACT-Tools. Processing High- Volume Telescope Data. Jens Buß, Christian Bockermann, Kai Brügge, Maximilian Nöthe. for the FACT Collaboration

Similar documents
FACT-Tools Streamed Real-Time Data Analysis

The Evolution of Big Data Platforms and Data Science

Performance quality monitoring system (PQM) for the Daya Bay experiment

Modern Data Warehouse The New Approach to Azure BI

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Unifying Big Data Workloads in Apache Spark

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hadoop, Yarn and Beyond

SparkStreaming. Large scale near- realtime stream processing. Tathagata Das (TD) UC Berkeley UC BERKELEY

Distributed Archive System for the Cherenkov Telescope Array

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Webinar Series TMIP VISION

The age of Big Data Big Data for Oracle Database Professionals

BIG DATA COURSE CONTENT

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Practical Big Data Processing An Overview of Apache Flink

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

The ATLAS EventIndex: Full chain deployment and first operation

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Lecture 11 Hadoop & Spark

Big Data Architect.

Specialist ICT Learning

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Chapter 4: Apache Spark

Spark, Shark and Spark Streaming Introduction

Social Network Analytics on Cray Urika-XA

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Apache Kylin. OLAP on Hadoop

Scalable Tools - Part I Introduction to Scalable Tools

DATA SCIENCE USING SPARK: AN INTRODUCTION

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Oracle Big Data Connectors

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

HDInsight > Hadoop. October 12, 2017

Exam Questions

Databases 2 (VU) ( / )

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

April Copyright 2013 Cloudera Inc. All rights reserved.

Data Analytics with HPC. Data Streaming

Modern Stream Processing with Apache Flink

Apache Spark 2.0. Matei

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Processing of big data with Apache Spark

Data reduction for CORSIKA. Dominik Baack. Technical Report 06/2016. technische universität dortmund

Prototyping Data Intensive Apps: TrendingTopics.org

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Data Architectures in Azure for Analytics & Big Data

Big Data Applications with Spring XD

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

A web portal to analyze and distribute cosmology data

Application-Aware SDN Routing for Big-Data Processing

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Talend Big Data Sandbox. Big Data Insights Cookbook

Machine Learning in Data Quality Monitoring

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Deep Learning Photon Identification in a SuperGranular Calorimeter

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Innovatus Technologies

Data Platforms and Pattern Mining

Hadoop course content

CSE 444: Database Internals. Lecture 23 Spark

Cloud Analytics and Business Intelligence on AWS

Big Data Infrastructure at Spotify

Welcome to the New Era of Cloud Computing

Digital Enterprise Platform for Live Business. Kevin Liu SAP Greater China, Vice President General Manager of Big Data and Platform BU

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Lambda Architecture for Batch and Stream Processing. October 2018

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Spark Streaming. Guido Salvaneschi

MapR Enterprise Hadoop

Big Data Infrastructures & Technologies

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Oracle R Technologies

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

/ Cloud Computing. Recitation 15 December 6 th 2016

Autopsy as a Service Distributed Forensic Compute That Combines Evidence Acquisition and Analysis

microsoft

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Transcription:

FACT-Tools Processing High- Volume Telescope Data Jens Buß, Christian Bockermann, Kai Brügge, Maximilian Nöthe for the FACT Collaboration

Outline FACT telescope and challenges Analysis requirements Fact-Tools and Streams Big Data scalability 2

FACT - First G-APD Cherenkov Telescope Type: IACT with 1440 Pixels Location: La Palma, 2200m a.s.l. First Light: October 2011 Specialty: Semiconductor based Pixels (SiPM) Goal: Monitoring of bright TeV Blazars 3

FACT - First G-APD Cherenkov Telescope Type: IACT with 1440 Pixels Location: La Palma, 2200m a.s.l. First Light: October 2011 Specialty: Semiconductor based Pixels (SiPM) up to 1 TB of data per night Maximized observation time Goal: Monitoring of bright TeV Blazars 4

FACT - First G-APD Cherenkov Telescope Type: IACT with 1440 Pixels Location: La Palma, 2200m a.s.l. First Light: October 2011 Specialty: Semiconductor based Pixels (SiPM) Goal: Monitoring of bright TeV Blazars up to 1 TB of data per night Maximized observation time Low latency Analysis Results more than 50 MB/s 5

IACT Detection Principle Gammas Charged Particles 6

FACT Analysis Chain 7

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation 8

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Arrival Time Num. Photons 9

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation 10

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation 11

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Feature Representation ID M B F1 F2 Z #Isles Width 1 3,54 4,93 7,53 2,99 6,78 4,52 5,64 2 4,05 0,62 4,13 1,41 3,13 1,40 1,98 3 3,68 8,88 4,23 5,05 4,78 7,75 8,48 4 6,75 4,58 0,48 3,79 6,37 3,81 1,76 5 7,59 3,68 5,52 1,23 5,76 6,70 1,10 6 0,25 3,77 7,06 5,63 0,25 9,02 5,51 7 5,29 3,44 7,93 6,00 4,05 0,08 8,35 8 8,34 1,71 4,63 7,07 0,63 7,71 4,84 12

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Feature Representation ID M B F1 F2 Z #Isles Width 1 3,54 4,93 7,53 2,99 6,78 4,52 5,64 2 4,05 0,62 4,13 1,41 3,13 1,40 1,98 3 3,68 8,88 4,23 5,05 4,78 7,75 8,48 4 6,75 4,58 0,48 3,79 6,37 3,81 1,76 5 7,59 3,68 5,52 1,23 5,76 6,70 1,10 MC: Train Random Forest Data: Apply 6 0,25 3,77 7,06 5,63 0,25 9,02 5,51 7 5,29 3,44 7,93 6,00 4,05 0,08 8,35 8 8,34 1,71 4,63 7,07 0,63 7,71 4,84 13

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Machine Learning 14

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Fact-Tools Library Machine Learning Streams Framework 15

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Fact-Tools Library Machine Learning Streams Framework 16

FACT Analysis Chain Photon Image Cleaning Feature- Energy- Estimation Signal Separation Online Application of ML Fact-Tools Library Machine Learning Streams Framework 17

The Streams framework Provide middle layer framework -> streams Simple API for data stream processes Model data flow graphs to describe process Embeddable user functions.e.g Fact-Tools Algorithms + Process Design 18 https://sfb876.de/streams/ https://bitbucket.org/cbockermann/streams

The Fact-Tools Library Collection of IACT specific Streams processors Methods for e.g.: Pixel feature extraction (# Photons, Arrival time) Image cleaning Image feature extraction (Width Length) Event visualization Rapid Prototyping of processing pipeline test new features, different classifiers 19 https://sfb876.de/fact-tools/ https://github.com/fact-project/fact-tools

Example: Fact-Tools RAW Data Processing Calibration Photon Image Cleaning Image Features 20

Example: Fact-Tools RAW Data Processing Calibration Photon Image Cleaning Image Features <stream id="fact" class="fact.zfitsstream url="file:/data.fz"> <process input="fact" copies="1"> <Calibration someparameter="foobar"/> < outputkey="photonsperpixel" /> <Cleaning input="photonsperpixel" output="shower" /> <ImageParameter input="shower what= size" /> <WriteResult where="file:/some/url" /> </process> 21

Example: Fact-Tools RAW Data Processing New Calibration Photon Image Cleaning Image Features <stream id="fact" class="fact.zfitsstream url="file:/data.fz"> <process input="fact" copies="1"> <MyFancyNewCalibration evenbetterparameter="foobar2"/> < outputkey="photonsperpixel" /> <Cleaning input="photonsperpixel" output="shower" /> <ImageParameter input="shower what= size" /> <WriteResult where="file:/some/url" /> </process> 22

How to scale this to streaming and Big Data? 23

Requirements Offline Analysis Batch ~ 5 Years > 350 TB (source raw data) > 240 k Files Real Time Stream 2 byte * 1440 pixel * 300 samples * 60 Hz > 50 MB/s 24

Lambda architecture Batch Layer Speed Layer Serving Layer 25

Lambda architecture Batch Layer Speed Layer Serving Layer 26

Batch Layer Batch Layer Fast reprocessing of historic data parallel batch processing Develop and test new features Data locality -> Map&Reduce Speed Layer Serving Layer 27

Speed Layer Batch Layer Real-Time Analysis Low latency results Distributed load -> speed-up Speed Layer Serving Layer 28

Serving Layer Batch Layer Queries on the data archive Book keeping of tasks Insert jobs to cluster Database Speed Layer Serving Layer 29

Lambda architecture Batch Layer Speed Layer Serving Layer 30

How to connect these tool to Fact-Tools? 31

Extention for Streams Use Streams as middle layer to adapt e.g. SPARK topology + Streams Runtime Spark 32

First approach for integration of batch system User REST-API + Streams + libs ML Integration Distributed FACT-Tools Mongo -DB Event-Index HDFS Raw Data Apache Spark Apache Hadoop Data Storage 33

Summary Used streams as modeling tool for data stream processes XML provides flexible way for rapid-prototyping (new pipelines, test different preprocessing steps, sharing of processes) Same code base for speed layer (RTA) and batch layer (Archive) Proof of concept for streams as Middle layer framework to use modern cluster and streaming technology (e.g. Hadoop, Spark, Storm) Joined effort of computer scientists and physicist is fruitful for both fields 34

Backup 35

Performance 36

Robust Detectors 37 Jens Björn Buß TU Dortmund jens.buss@udo.edu

Motivation Charged Particles Neutrinos Gammas 38

The streams framework Provide middle layer framework -> streams Simple API for data stream processes Model data flow graphs to describe process Embeddable user functions Twitter FACT Tools 39 https://sfb876.de/streams/

Signal and Background Charged Particles Gammas 40

IACT Detection Principle 41

Data Rate 90 Chart Title 67,5 45 22,5 0 2011 2012 2013 2014 2015 2016 Volume [TB] Num Files [k Files] RaRo [GB/File] 42