Machine Learning in Data Quality Monitoring

Similar documents
Data Quality Monitoring at CMS with Machine Learning

The CMS data quality monitoring software: experience and future prospects

Data Quality Monitoring Display for ATLAS experiment

Deep Learning Photon Identification in a SuperGranular Calorimeter

Data handling and processing at the LHC experiments

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

Tracking and flavour tagging selection in the ATLAS High Level Trigger

Prompt data reconstruction at the ATLAS experiment

The CMS Computing Model

CMS High Level Trigger Timing Measurements

CMS data quality monitoring: Systems and experiences

LHCb Computing Resources: 2018 requests and preview of 2019 requests

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Monte Carlo programs

Machine Learning Chapter 2. Input

Track reconstruction with the CMS tracking detector

ATLAS PILE-UP AND OVERLAY SIMULATION

MULTIVARIATE ANALYSIS OF STEALTH QUANTITATES (MASQ)

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Creating Situational Awareness with Spacecraft Data Trending and Monitoring

Introduction Challenges with using ML Guidelines for using ML Conclusions

Data: a collection of numbers or facts that require further processing before they are meaningful

Event Displays and LArg

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

HLT Infrastructure Commissioning

Data Mining Practical Machine Learning Tools and Techniques

L1 and Subsequent Triggers

DEFINITIONS AND REFERENCES

CMS Conference Report

Muon Reconstruction and Identification in CMS

First LHCb measurement with data from the LHC Run 2

Early experience with the Run 2 ATLAS analysis model

Performance of the ATLAS Inner Detector at the LHC

PoS(EPS-HEP2017)523. The CMS trigger in Run 2. Mia Tosi CERN

Virtualizing a Batch. University Grid Center

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Means for Intrusion Detection. Intrusion Detection. INFO404 - Lecture 13. Content

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Machine Learning for (fast) simulation

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

First Operational Experience from the LHCb Silicon Tracker

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Visualizing the World

Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers

Basic Concepts Weka Workbench and its terminology

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

The CMS L1 Global Trigger Offline Software

EXPERT SERVICES FOR IoT CYBERSECURITY AND RISK MANAGEMENT. An Insight Cyber White Paper. Copyright Insight Cyber All rights reserved.

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Particle Track Reconstruction with Deep Learning

Network Simulator Project Guidelines Introduction

Using the DATAMINE Program

Physics CMS Muon High Level Trigger: Level 3 reconstruction algorithm development and optimization

FPGA based Network Traffic Analysis using Traffic Dispersion Graphs

The performance of the ATLAS Inner Detector Trigger Algorithms in pp collisions at the LHC

FACT-Tools. Processing High- Volume Telescope Data. Jens Buß, Christian Bockermann, Kai Brügge, Maximilian Nöthe. for the FACT Collaboration

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Data Processing and Analysis Requirements for CMS-HI Computing

Pimp My PE: Parsing Malicious and Malformed Executables. Virus Bulletin 2007

Build a system health check for Db2 using IBM Machine Learning for z/os

Popular SIEM vs aisiem

Georgia Department of Education. Content standards for Grade 4 are arranged within the following domains and clusters:

Data Reconstruction in Modern Particle Physics

CURIOUS BROWSERS: Automated Gathering of Implicit Interest Indicators by an Instrumented Browser

Big Data Analytics and the LHC

Data Analysis in Experimental Particle Physics

Performance quality monitoring system for the Daya Bay reactor neutrino experiment

Data Mining Overview. CHAPTER 1 Introduction to SAS Enterprise Miner Software

CMS FPGA Based Tracklet Approach for L1 Track Finding

OUTSMART ADVANCED CYBER ATTACKS WITH AN INTELLIGENCE-DRIVEN SECURITY OPERATIONS CENTER

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Precision Timing in High Pile-Up and Time-Based Vertex Reconstruction

RSA INCIDENT RESPONSE SERVICES

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Preprocessing DWML, /33

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

The GLAST Event Reconstruction: What s in it for DC-1?

Collecting, Outputting & Inputting Data in AnyLogic

Status Report of PRS/m

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Introduction to Data Mining and Data Analytics

3 Feature Selection & Feature Extraction

Machine Learning in Biology

Nesnelerin İnternetinde Veri Analizi

Data Preprocessing. Data Preprocessing

On the Design and Implementation of a Generalized Process for Business Statistics

Top 20 Data Quality Solutions for Data Science

HLT Hadronic L0 Confirmation Matching VeLo tracks to L0 HCAL objects

Machine Learning based session drop prediction in LTE networks and its SON aspects

Analysis of Sorting as a Streaming Application

Data Warehousing. Data Warehousing and Mining. Lecture 8. by Hossen Asiful Mustafa

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

DI TRANSFORM. The regressive analyses. identify relationships

Expressing Parallelism with ROOT

Transcription:

CERN openlab workshop on Machine Learning and Data Analytics April 27 th, 2017 Machine Learning in Data Quality Monitoring a point of view

Goal Maximize the best Quality Data for physics analysis Data Quality Monitoring (DQM) and Data Certification: Monitors and ensures data quality of each data Measuring data properties Anomaly detection Certification LHC Delivered Luminosity CMS Recorded Luminosity Minimize this gap Maximize good data Golden JSON, Good for all physics 2

DQM system used in 1kHz ~8% Physics Stream ~5% Physics Stream 100% Physics Stream 40MHz 100Hz 50Hz 1kHz 3

AUTOMATIC QA OPERATIONAL QA ---------SCIENCE--------- QA Data Quality Assessment 1) near-real-time applications. fraction of the events with a rate of about 100 Hz. automatic tests are validated via visual human inspection. identify problems in the detector and trigger system 2) fast reconstruction on a part of data. subset of the data promptly reconstructed and monitored with ~1h. goodness of the data regarding also the reconstruction software and the alignment and calibration constants 3) full reconstructed data. full set of data taken promptly reconstructed and monitored with ~48h latency. same aim as 2), but typically better alignment and calibration constants are available 3-bis) reprocessed data once per year or at need. data are again monitored and certified. same aim as 2) and 3), but typically better reconstruction software and better alignment and calibration constants are available On the side: release validation on Monte Carlo production,. validate functionalities and performance of the reconstruction software 4

DQM GUI: Summary Workspace One plot per sub system 5

What we monitor for Quality Assessment Online DQM: mostly focused on Hardware level checks. integrity of the data-format, errors from the read-out electronics count errors, classify errors, monitor # of errors vs LS. occupancy of signals (hits) in the various channels maps and distributions in the detector presence of noisy/dead read-out channels. distribution of energy/momentum/time of the signals. resolution plots, pulls Offline Data Certification: principally focus on Physics. detector subsystem:..certify the correctness of detector calibration and alignment application, these conditions are recalculated una tantum, because statistics dependent Almost same distributions as online. physics objects (muon, electron, photons, tracks, jets).. Monitoring quantities product of the reconstruction, ingredient of future analysis (# vertices, 3 tracks, energy, typology, topology of the particles, key quantities Summary and occupancy maps Distribution of quantities used to characterize the candidate particles 6

Limits of a Human-based QA. Volume budget Limited amount of quantities that a human can process in a finite time interval. Time delay Online: automatic test+ human intervention = ~ minutes trigger stop/continue data taking Offline: reconstruction data time +human intervention = ~ 1 week Need intermediate step. Expensive, in terms of human resources Online: shifter 24/7 + the effort to train her, maintain instructions, etc Offline: duplication of effort (many detector and physics object experts) on weekly basis. Makes assumptions on potential failure scenarios QA paradigm: scrutiny of a large,but chosen, # of histograms in comparison with a reference Conservative strategy that could prevent unforeseen anomaly detection. missed time dimension Granularity of certification is lumi-section* to reduce data lost due to short pbl condition, but evaluation relies often on integrated quantities, punctual anomaly may not surface Good news is the current system works but volume of data has grown so large it is becoming increasingly difficult to QA all data We aim to incorporate modern ML techniques to perform quality in future intelligent archives 7 *minimum time interval within a run

Technical characteristics required for QA QA tasks ML technique WE NEED: - operate effectively with minimal human guidance - anticipate important events - adapt behavior in response to changes in data content, user needs, or available resources WE HOPE: Developed machine learning and analytics-based solutions will: - improve the accuracy of data quality - establish new data quality rules to sharpen data error detection and correction - increase the speed at which this is achieved 8

Technical characteristics required for QA REQUIREMENTS: To do this the intelligent data archive will need to - learn from its own experience recognize data quality problems solely from its experience with past data, rather than having to be told explicitly the rules for recognizing such problems. E.g. flag suspect data by recognizing departures from past norms E.g. categorize data based on the type or severity of data quality problem. - recognize hidden patterns in incoming data streams, could learn to recognize problems either from explicit examples or simply its own observation of different types of data - data access requests, ability of an archive to respond automatically to data quality problems. E.g. significant increases, in the amount of data flagged as bad or missing, might indicate that the data are exceeding the bounds expected by science QA algorithms The intelligent archive could: - notify DQ experts so that the issue can be further examined and resolved. - archive could retrieve old data to confirm a data quality problem, obtain data from an alternate source, or request that the data be recollected or reprocessed in response to confirmed DQ pbl (from automatic QA to autonomous data QA) 9

Technical characteristics required for QA REQUIREMENTS: - Fast and efficient operation Some current DQ applications require data to be delivered < 1h hour so the need to perform the DQ of a relatively large amount of data within a few minutes. Machine Learning algos are compute intensive,but we could still meet this requirement the computational effort is associated to the deriving rules or training the system; the rules themselves are generally computational easy to apply in an operational mode. If QA is a function of applying the rules, rather than deriving the rules, then the computational complexity associated with deriving the rules is not an issue, because this can be done on a subset of the data in an offline process. 10

Which approach? Numerous ML methods, techniques, and algorithms can be applied to data QA Probably a tool-only approach would not suffice as a complete and manageable solution. We aim, certainly at the beginning, to a mixed situation where the intelligent automatic QA is integrated to human expertise The QA problem is primarily one of classification prevalence of good /bad flag types rather than numerical quality indicator We focus on classifiers, though numerical predictors may be useful in intermediate steps in the QA process 11

Which approach? Supervised classifiers: Require a training set of data previously classified (human interpretation or direct observation) Direct application: train the classifier on data with known quality signatures. Pros:. incorporate information not available in the data to be classified, metadata (e.g., determine a set of data good or bad based on derived data products or human eye). additional flexibility + opportunity to refine directly QA process vs time (adding new examples to the training data set) Cons:. accumulate sufficient training set presents a challenge Indirect application: E.g.classify all of the data according to a set of positive categories A behavior that cannot be easily classified into a category may then be signs of quality problems such as random events or mixed contributions. Important : is it a bug or is it a feature? Unsupervised classifiers: Generate classes directly from the observed data, clustering. Pros:. identify new classes that may not have been defined a priori.. identify anomalous datasets without explicit training, either by directly identifying separate clusters for typical and unusual data, or by identifying normal clusters to used as references to identify outliers in a data stream.. less human effort, no need to identify different classes of DQ problems and good training examples of each 12

CONCLUSIONS. Goal: Maximize the best Quality Data for physics analysis.. Introduction to CMS multi-steps human based system for data monitoring and certification Satisfied for the goodness of it, we are able to the limits of it: Human effort, delay time, volume increasing, protective umbrella (too blind?). Looking toward automatic QA: Needs: minimal human guidance, predictivity of events, adjustable behavior.. Requirements: be intelligent experience docet, be fast but efficient, recognize the unknown, be proactive.. which approach? Bring together computing, data scientists and physicists is the winning approach Outgoing projects: IBM CMS-DQM project for near-real-time anomaly detection Yandex CMS DC project for automatic certification for science QA Thank you Virginia 13

Machine Learning in Data Quality Monitoring a point of view 14