Big Data Analytics Tools. Applied to ATLAS Event Data

Similar documents
Big Data Tools as Applied to ATLAS Event Data

Analytics Platform for ATLAS Computing Services

The ATLAS EventIndex: Full chain deployment and first operation

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Early experience with the Run 2 ATLAS analysis model

Kibana, Grafana and Zeppelin on Monitoring data

microsoft

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Toward real-time data query systems in HEP

Turbocharge your MySQL analytics with ElasticSearch. Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Elasticsearch & ATLAS Data Management. European Organization for Nuclear Research (CERN)

DATA FORMATS FOR DATA SCIENCE Remastered

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

ATLAS Analysis Workshop Summary

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Goal of this document: A simple yet effective

Facilitating Collaborative Analysis in SWAN

Exam Questions

Data Formats. for Data Science. Valerio Maggio Data Scientist and Researcher Fondazione Bruno Kessler (FBK) Trento, Italy.

Collider analysis recasting with Rivet & Contur. Andy Buckley, University of Glasgow Jon Butterworth, University College London MC4BSM, 20 April 2018

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Striped Data Server for Scalable Parallel Data Analysis

Modern Data Warehouse The New Approach to Azure BI

WITH INTEL TECHNOLOGIES

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Survey of data formats and conversion tools

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Testing storage and metadata backends

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Application of machine learning and big data technologies in OpenAIRE system

Hadoop Online Training

Spark and HPC for High Energy Physics Data Analyses

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

Using Apache Spark for generating ElasticSearch indices offline

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

SciSpark 201. Searching for MCCs

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Machine Learning Software ROOT/TMVA

Deep dive into analytics using Aggregation. Boaz

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Cloud Computing & Visualization

Building A Billion Spatio-Temporal Object Search and Visualization Platform

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

About Intellipaat. About the Course. Why Take This Course?

CloudExpo November 2017 Tomer Levi

Monitoring of large-scale federated data storage: XRootD and beyond.

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

Big Data Software in Particle Physics

Talend Big Data Sandbox. Big Data Insights Cookbook

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Triton file systems - an introduction. slide 1 of 28

Apache Kudu. Zbigniew Baranowski

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

PUBLIC SAP Vora Sizing Guide

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

The Pliny Database PDB

Hadoop Development Introduction

Oracle Big Data Connectors

Data Science Bootcamp Curriculum. NYC Data Science Academy

RIPE NCC Routing Information Service (RIS)

Analysis preservation and recasting with Rivet & Contur. Andy Buckley, University of Glasgow FNAL Reinterpretation Workshop, 16 Oct 2017

Online Data Analysis at European XFEL

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

OpenStack Magnum Pike and the CERN cloud. Spyros

April Copyright 2013 Cloudera Inc. All rights reserved.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

From Clicks to Models. The Wikimedia LTR Pipeline

Expressing Parallelism with ROOT

Is Elasticsearch the Answer?

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Data Analyst Nanodegree Syllabus

Approaching the Petabyte Analytic Database: What I learned

Data Analyst Nanodegree Syllabus

Innovatus Technologies

Linear Regression Optimization

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Deep Learning Frameworks with Spark and GPUs

Using Elastic with Magento

Albis: High-Performance File Format for Big Data Systems

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Amazon Elasticsearch Service

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

Python Certification Training

TPCX-BB (BigBench) Big Data Analytics Benchmark

Data Transformation and Migration in Polystores

Wrangling Logs with Logstash and ElasticSearch

Transcription:

Big Data Analytics Tools Applied to ATLAS Event Data Ilija Vukotic University of Chicago CHEP 2016, San Francisco

Idea Big Data technologies have proven to be very useful for storage, visualization and analysis of the ATLAS distributed computing (meta)data. If we could apply the same technologies on real/mc data, we could: Do quick investigations without writing any code, directly in a web browser Deliver data to users: Only events passing cuts and only variables user needs. No need for experiment specific libraries - make it easier to use machine learning tools like Spark, Jupyter, R, SciPy, Caffe, TensorFlow. This would simplify Machine Learning Challenges: Higgs Boson ML Challenge, the Tracking challenge,... Serve as a backend to: Event Viewers (VP1, ATLANTIS, ATLASrift,...), make them platform independent All kinds of education and outreach tools The question is would the same technology map well on event data? Would event data structure map correctly, would data size be acceptable, what indexing and retrieval rate could we achieve? 2

Related work EventIndex - a complete catalogue of all ATLAS events. Collection:ActiveMQ, storage:hadoop, indexing:hbase, custom WebUI, CLI. ~1300 B/event. Event identifier, trigger pattern, hit counts, pointers to the event in all processing stages. Used for: event picking, event skimming, Panda event service Contains ~ 40*109 data and 25*109 MC event records. Searches at 10ms/event. Open data - http://atlas-opendata.web.cern.ch/ Provides access to real and simulated data, and educational tools, to the public Three levels of sophistication (simple visualizations, Jupyter based analysis, full set of analysis tools to download - documentation, datasets, software, VMs) Lukas Heinrich s project Aretha https://www.youtube.com/watch?v=s5sktxhdcng Streams generated HepMC data from SHERPA, Herwig, MadGraph (locally or in cloud) to ZeroMQ Runs ZeroMQ data collector, Elasticsearch, Kibana all in docker Makes plots while jobs running, validate data using eg. Rivet 3

ATLAS distributed computing analytics platform Analytics platform There is a number of possible workflows that would work for storing and processing event data while using already existing ADC analytics platform: Store data in Hadoop (avro) process using pig or Spark. Not interactive, not a data access solution. Store in HDF5, use Spark. Performant analysis, does not solve data access.... Our selected workflow: Data indexed in Elasticsearch Fast visualization in Kibana Open data access through Elasticsearch REST API Data analysis on a co-located Jupyter cluster 4

Elasticsearch cluster Elasticsearch is a distributed, real-time data and analytics, high availability open source search engine built on top of Apache Lucene. Works with structured JSON documents, schema free, by default all fields are indexed, data are compressed. 5 data + 3 head nodes E5-2620v4, 64GB RAM, 4x800 GB SSD, 10Gbps NIC Used for ATLAS Distributed Computing analytics Quite high base load 5

Jupyter At University of Chicago Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz x2 Tesla K20c x2 128GB RAM, 4.5TB RAID-5 for /scratch, 1TB RAID-1 for / If needed one more machine with 2 x Xeon Phi (64 core) Simple universal user, pass authentication, notebooks automatically checked in a git repo. At CERN SWAN (Service for Web based Analysis) Currently in Beta Code in CERNBox Advantages: Most of ML libraries installed, if there is anything missing we can easily add it. Local to Elasticsearch Learn from codes of others. See what was already done. Managed by CERN Your notebooks are yours only One can have the same environment anywhere you have CVMFS, just do: setupatlas lsetup "lcgenv -p LCG_85swan3 x86_64-slc6-gcc49-opt ROOT rootpy..." 6

ATLAS Data Formats Consider only user oriented DxAOD format. xaod is a common centrally produced data format ~PB size. DxAODs are ~TB scale derivations, centrally produced, skimmed/slimed and customized for different physics groups use. Mini xaods or flat n-tuples are subgroup specific and not considered here (~GB scale). Type Event size [kb/event] Branches DAOD data15_13tev 30 647 AOD data15_13tev 184 1742 DAOD mc15_13tev 25 999 AOD mc15_13tev 216 2512 7

ROOT structure Mapping events We used root_numpy and rootpy. We create document mapping template One per data format Describes event in terms of native Elasticsearch types and structures Tree is not very deep (5 levels) but needs care to get it right ( array vs object vs nested object) There is no need to index all of the variables, we ll still be able to deliver them, just not search/filter on them. Possible large improvement on data size (not used for this study). *Branch :EventInfoAux. *Entries : 6752 : BranchElement (see below) *...* *Br 2 :EventInfoAux.xAOD::AuxInfoBase : BASE *Entries : 6752 : Total Size= 529349 bytes File Size = 21915 *Baskets : 17 : Basket Size= 104448 bytes Compression= 24.12 *...* float32 AntiKt4EMTopoJetsAuxDyn Split23 numpy objects *Br 3 :EventInfoAux.runNumber : UInt_t * [ 570.70159912 184.66441345] *Entries : 6752 : Total Size= 29278 bytes File Size = 2124 float32 AntiKt4EMTopoJetsAuxDyn *Baskets : 16 Split34 : Basket Size= 5120 bytes Compression= 13.46 [ 210.48747253 159.41931152] int32 AntiKt4EMTopoJetsAuxDyn ConstituentScale [0 0] object AntiKt4EMTopoJetsAuxDyn SumPtTrkPt1000 [ array([ 2215.11962891, 0., 1192.48486328, Mapping template "EventInfoAuxDyn": { 1890.73132324, 0., 0., 0., 0. "type":"object",, 0., 0., 0., 0. "properties":{, 0., 0. ], dtype=float32) "HLT_xe70_mht":{"type":"boolean"}, "HLT_xe70_tc_lcw":{"type":"boolean"}, array([ 5278.20996094, 0., 0., 2568.26782227, "HLT_xe80":{"type":"boolean"}, "HLT_xe80_mht":{"type":"boolean"}, "HLT_xe80_tc_lcw":{"type":"boolean"}, "mpx":{"type":"double"}, "name":{"type":"string","index":"not_analyzed"}, } }, There is a limit on 50 nested objects / document. Indexed event 8

Indexing events We indexed di-jet events xaods, and DxAODs from DAOD_EXOT2 derivation. Used indexing method proved rather inefficient (parsing all the event structure for all events), also process was not tuned (batch size, etc). Event size type Indexing rate [Hz] [kb/event] increase DAOD data15_13tev 146 5.4 21 AOD data15_13tev 377 2.0 8 DAOD mc15_13tev 116 4.6 19 AOD mc15_13tev 385 1.8 9 9

Data visualizations Kibana can be used to do quick visualizations. Use filter bar to make cuts. Define additional derived variables. Bad Visualizations not made with scientists in mind. Only simple operations supported. Good Visualizations can be organized and saved in nice reusable dashboards. As all computation are distributed over all the ES nodes, visualizations are very fast even with billions of events. 10

Filtering and aggregations Streaming full events: "query" : { "match_all": {} } To return only some variables add: "fields" : ["AntiKt4EMTopoJetsAuxDyn.ActiveArea",...], To select on a variable (cut): "query" : { "term" : { "EventInfoAuxDyn.HLT_xe80" : True } } Complex conditions (must, must_not, should, filter): 'bool':{ 'must':[ {"term" : { "EventInfoAuxDyn.HLT_xe80" : False }}, Aggregations (cut and count analysis), very powerful, fast, can be nested. In our tests all are sub-second duration operations. Simple count: "aggs": { "docs": { "value_count": { "field": "_type" } } } Returning stats: "aggs": { "ptstats": { "stats": { "field": "TrackPt" } } } Stats in bins: "aggs": { "ptstats": { "histogram": { "field": "TrackPt", Interval : 50 } } } {"range" : { "AntiKt4EMTopoJetsAuxDyn.GhostAntiKt2TrackJetCount" : { "gt": 2 } } }, {"range" : { "AntiKt4LCTopoJetsAuxDyn.JetEMScaleMomentum_m": { "gt": 10000 } } }, {"exists" :{ "field" : "HLT_xAOD JetContainer_a4tcemsubjesFSAuxDyn.HECQuality" }} ]} 11

Skimming & slimming performance DAOD data15_13tev AOD data15_13tev DAOD mc15_13tev AOD mc15_13tev Read test Rate [Hz] CPU* [%] Rate [Hz] CPU [%] Rate [Hz] CPU [%] Rate [Hz] CPU [%] 1 variable from all events 486 1 264 1 520 1 225 1 10 variables from all events 453 3 263 1 491 3 227 1 10 variables from events passing cut ( ~2 % events) 606 / 32 khz 22 251 / 13 khz 10 971 / 50 khz 15 226 / 11 khz 20 Full events passing cut ( ~2% events) 161 / 6.3 khz 77 35 / 1.8 khz 84 121 / 6.2 khz 80 32 / 1.5 khz 82 116 86 39 87 120 88 31 86 Streaming all full events Selections are basically free, cost is in data transfer back to client and client parsing JSON. *client CPU usage. 12

Data modifications Simple fixes can be done by: Adding corrected variable Defining scripted variables Re-calibration from existing data would be performed by a central batch re-indexing. Any kind of reweighting can be done by defining custom scoring function. Weighting: "script_score": { "script": { "lang": lang, params : { "param1": value1, param2 :value2 }, Inline : _score * doc[ orig_pt ].value / pow(param1, param2) } } 13

Conclusions Feasibility of actual event indexing is tested Input/output rates are reasonable, increase in data volume is at the expected level. All kinds of optimizations are possible Indexing directly from ATLAS analysis framework Optimized mapping Compressed responses Visualizations, skimming and sliming are very fast, very convenient for cut flow type analyses Full data retrieval while moderately slow, consumes too much CPU in parsing JSON returned. Elasticsearch is easy to scale up to a data center level (~ few hundreds of servers), bigger than that one need to add tribe nodes. 14