Toward real-time data query systems in HEP

Similar documents
Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Striped Data Server for Scalable Parallel Data Analysis

Big Data Software in Particle Physics

Survey of data formats and conversion tools

How to discover the Higgs Boson in an Oracle database. Maaike Limper

Big Data Analytics Tools. Applied to ATLAS Event Data

VISPA: Visual Physics Analysis Environment

Bridging the Particle Physics and Big Data Worlds

Big Data Tools as Applied to ATLAS Event Data

The Run 2 ATLAS Analysis Event Data Model

Open Data Standards for Administrative Data Processing

Expressing Parallelism with ROOT

arxiv: v2 [hep-ex] 14 Nov 2018

An SQL-based approach to physics analysis

CMS Analysis Workflow

Rapidly moving data from ROOT to Numpy and Pandas

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

DIAL: Distributed Interactive Analysis of Large Datasets

HSF Community White Paper Contribution Towards a Column-oriented Data Management System for Hierarchical HEP Data

EventStore A Data Management System

Albis: High-Performance File Format for Big Data Systems

New Developments in Spark

Declara've Parallel Analysis in ROOT: TDataFrame

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Optimized Scientific Computing:

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Data Access 3. Managing Apache Hive. Date of Publish:

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

PROOF-Condor integration for ATLAS

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Plotting data in a GPU with Histogrammar

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

Spark and HPC for High Energy Physics Data Analyses

Data Quality Monitoring at CMS with Machine Learning

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

CS 261 Fall Mike Lam, Professor. Threads

Spark Overview. Professor Sasu Tarkoma.

Shark: Hive (SQL) on Spark

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

pyframe Ryan Reece A light-weight Python framework for analyzing ROOT ntuples in ATLAS University of Pennsylvania

Rethinking the Data Model: The Drillbit Proof-of- Concept Library

Multi-threaded, discrete event simulation of distributed computing systems

Data Analyst Nanodegree Syllabus

Impala Intro. MingLi xunzhang

Analytics Platform for ATLAS Computing Services

Dremel: Interactive Analysis of Web-Scale Database

Data Analyst Nanodegree Syllabus

CSCI-580 Advanced High Performance Computing

Deep Learning Photon Identification in a SuperGranular Calorimeter

EsgynDB Enterprise 2.0 Platform Reference Architecture

arxiv: v1 [cs.dc] 20 Jul 2015

Time Series Storage with Apache Kudu (incubating)

Lecture 9 Dynamic Compilation

Methods for Scalable Interactive Exploration of Massive Datasets. Florin Rusu, Assistant Professor EECS, School of Engineering

Database Systems CSE 414

Experience with Data-flow, DQM and Analysis of TIF Data

Unifying Big Data Workloads in Apache Spark

Next-Generation Cloud Platform

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Oracle Exadata: Strategy and Roadmap

The CMS data quality monitoring software: experience and future prospects

Scan Algorithm Effects on Parallelism and Memory Conflicts

April Copyright 2013 Cloudera Inc. All rights reserved.

Opportunities for container environments on Cray XC30 with GPU devices

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Why Study Assembly Language?

Streamlining CASTOR to manage the LHC data torrent

Worksheet #4. Foundations of Programming Languages, WS 2014/15. December 4, 2014

Spark and distributed data processing

September Development of favorite collections & visualizing user search queries in CERN Document Server (CDS)

Martin Kruliš, v

Data services for LHC computing

Computer Architecture and Organization. Instruction Sets: Addressing Modes and Formats

Performance Optimization for Informatica Data Services ( Hotfix 3)

Early experience with the Run 2 ATLAS analysis model

Machine Learning & Google Big Query. Data collection and exploration notes from the field

Lecture 2: January 24

Was ist dran an einer spezialisierten Data Warehousing platform?

Cloud Computing & Visualization

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Enabling Persistent Memory Use in Java. Steve Dohrmann Sr. Staff Software Engineer, Intel

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

Proposal for parallel sort in base R (and Python/Julia)

GPUMAP: A TRANSPARENTLY GPU-ACCELERATED PYTHON MAP FUNCTION. A Thesis. presented to. the Faculty of California Polytechnic State University,

Project 4 Query Optimization Part 2: Join Optimization

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

STARCOUNTER. Technical Overview

This Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory

Transcription:

Toward real-time data query systems in HEP Jim Pivarski Princeton University DIANA August 22, 2017 1 / 39

The dream... LHC experiments centrally manage data from readout to Analysis Object Datasets (AODs), but the next stage, from AODs to final plots, is handled separately by each physics group, often copying the data several times. 2 / 39

The dream... LHC experiments centrally manage data from readout to Analysis Object Datasets (AODs), but the next stage, from AODs to final plots, is handled separately by each physics group, often copying the data several times. Wouldn t it be nice if physicists could directly turn AOD into plots, quickly enough for interactive analysis (seconds per scan at most)? 3 / 39

Why this would be a Good Thing Users could plot the data before deciding what to keep. (Even if they do need skims for maximum likelihood fits, etc, these can be better streamlined after exploratory plotting.) Focus on physics and statistical issues, not data handling. Centralization facilitates provenance and reproducibility. Shared CPU, disk, and memory can be more efficient. Small institutions would not be priced out of analysis for lack of resources to copy and locally process the data. 4 / 39

Existence proof In some industries, it is possible to process petabyes of data and trillions of records in seconds 1, usually as SQL. In fact, there are many low-latency query server engines available, mostly open source. 1 https://wiki.apache.org/incubator/drillproposal 5 / 39

Existence proof In some industries, it is possible to process petabyes of data and trillions of records in seconds 1, usually as SQL. In fact, there are many low-latency query server engines available, mostly open source. Apache Drill comes closest to fitting our needs, but it s SQL (not expressive enough for HEP) Java (hard to link to HEP software) more suited to flat ntuple analysis (see next slides). 1 https://wiki.apache.org/incubator/drillproposal 6 / 39

Toward a query system for HEP This talk is about what we would need to make (or alter) a query system for HEP analysis: fast execution on nested, non-flat data distributed processing: caching and data locality, uneven and changing dataset popularity, aggregation HEP-specific query language (in the abstract for this talk, but I m going to focus more on the above) 7 / 39

Fast execution 8 / 39

The core issue SQL-like query engines are optimized for what we d call a flat ntuple analysis rectangular table of numbers, sliced, filtered, aggregated, and joined. 9 / 39

The core issue SQL-like query engines are optimized for what we d call a flat ntuple analysis rectangular table of numbers, sliced, filtered, aggregated, and joined. Only some late-stage HEP analyses fit this model, not AOD. 10 / 39

The core issue SQL-like query engines are optimized for what we d call a flat ntuple analysis rectangular table of numbers, sliced, filtered, aggregated, and joined. Only some late-stage HEP analyses fit this model, not AOD. In general, physics analysis requires arbitrary-length lists of objects: e.g. events containing jets containing tracks containing hits. 11 / 39

The core issue SQL-like query engines are optimized for what we d call a flat ntuple analysis rectangular table of numbers, sliced, filtered, aggregated, and joined. Only some late-stage HEP analyses fit this model, not AOD. In general, physics analysis requires arbitrary-length lists of objects: e.g. events containing jets containing tracks containing hits. But frameworks that create physics objects at runtime would be slow to process as queries. 12 / 39

Illustration with a microbenchmark Query: fill a histogram with jet p T of all jets. 0.018 MHz full framework (CMSSW, single-threaded C++) 250 MHz minimal for loop in memory (single-threaded C) 13 / 39

Illustration with a microbenchmark Query: fill a histogram with jet p T of all jets. 0.018 MHz full framework (CMSSW, single-threaded C++) 31 MHz allocate C++ objects on stack, fill histogram 250 MHz minimal for loop in memory (single-threaded C) 14 / 39

Illustration with a microbenchmark Query: fill a histogram with jet p T of all jets. 0.018 MHz full framework (CMSSW, single-threaded C++) 12 MHz allocate C++ objects on heap, fill, delete 31 MHz allocate C++ objects on stack, fill histogram 250 MHz minimal for loop in memory (single-threaded C) 15 / 39

Illustration with a microbenchmark Query: fill a histogram with jet p T of all jets. 0.018 MHz full framework (CMSSW, single-threaded C++) 2.8 MHz load jet p T branch (and no others) in ROOT 12 MHz allocate C++ objects on heap, fill, delete 31 MHz allocate C++ objects on stack, fill histogram 250 MHz minimal for loop in memory (single-threaded C) 16 / 39

Illustration with a microbenchmark Query: fill a histogram with jet p T of all jets. 0.018 MHz full framework (CMSSW, single-threaded C++) 0.029 MHz load all 95 jet branches in ROOT 2.8 MHz load jet p T branch (and no others) in ROOT 12 MHz allocate C++ objects on heap, fill, delete 31 MHz allocate C++ objects on stack, fill histogram 250 MHz minimal for loop in memory (single-threaded C) 17 / 39

Illustration with a microbenchmark Query: fill a histogram with jet p T of all jets. 0.018 MHz full framework (CMSSW, single-threaded C++) 0.029 MHz load all 95 jet branches in ROOT 2.8 MHz load jet p T branch (and no others) in ROOT 12 MHz allocate C++ objects on heap, fill, delete 31 MHz allocate C++ objects on stack, fill histogram 250 MHz minimal for loop in memory (single-threaded C) Four orders of magnitude in performance lost to provide an object-oriented view of the jets, with all attributes filled. 18 / 39

Alternative Instead of turning TBranch data into objects, translate the code into loops over the raw TBranch arrays ( BulkIO ). User writes code with event and jet objects: histogram = numpy.zeros(100, dtype=numpy.int32) def fcn(roottree, histogram): for event in roottree: for jet in event.jets: bin = int(jet.pt) if bin >= 0 and bin < 100: histogram[bin] += 1 which are translated into indexes over raw arrays: void fcn(int* events, int* jets, float* jetptdata, int* histogram) { int event, jet; for (event = 0; event < events[1]; event++) for (jet = jets[event]; jet < jets[event + 1]; jet++) { int bin = (int)jetptdata[jet]; if (bin >= 0 && bin < 100) histogram[bin] += 1; }} 19 / 39

Performance on real queries Four sample queries (see backup) using ROOT 1. without dropping branches, 2. dropping branches (SetBranchStatus), 3. using a slimmer file, 4. with BulkIO and transformed Python code. ROOT full dataset ROOT selective on full ROOT slim dataset 25 PLUR full dataset Rate in MHz (higher is better) 20 15 10 5 0 max p T eta of best by p T mass of pairs p T sum of pairs 20 / 39

Performance on real queries Four sample queries (see backup) using 4. same ROOT with BulkIO and transformed Python code, 5. raw arrays on SSD disk, 6. raw arrays in memory. (Caching system for query server.) PLUR full dataset 80 PLUR raw SSD disk PLUR raw memory Rate in MHz (higher is better) 60 40 20 0 max p T eta of best by p T mass of pairs p T sum of pairs 21 / 39

You can try this out Our PLUR library 1 puts Primitive, List, Union, and Record data into flat arrays and translates Python code to do array lookups. The translated Python is then compiled by Numba 2 (Python compiler for numerical algorithms) into fast bytecode. BulkIO contributions to ROOT 3 allow ROOT TBranch data to be rapidly viewed as Numpy arrays (10 faster than root numpy). This is being gathered into a HEPQuery application 4 to provide a query service. See the READMEs for more realistic examples. 1 https://github.com/diana-hep/plur 2 http://numba.pydata.org/ 3 https://github.com/bbockelm/root/ tree/root-bulkapi-fastread-v2 4 https://github.com/diana-hep/hepquery/ 22 / 39

Distributed processing (10 s of MHz rates on previous page were all single-threaded) 23 / 39

Challenges: not a standard batch job Overhead latency should be small enough for interactive use (less than a second). 24 / 39

Challenges: not a standard batch job Overhead latency should be small enough for interactive use (less than a second). Query-to-plot requires two phases: splitting into subtasks (map) and combining partial results (reduce). These must be coordinated. 25 / 39

Challenges: not a standard batch job Overhead latency should be small enough for interactive use (less than a second). Query-to-plot requires two phases: splitting into subtasks (map) and combining partial results (reduce). These must be coordinated. Input data should be cached with column-granularity. 26 / 39

Challenges: not a standard batch job Overhead latency should be small enough for interactive use (less than a second). Query-to-plot requires two phases: splitting into subtasks (map) and combining partial results (reduce). These must be coordinated. Input data should be cached with column-granularity. Subtasks should preferentially be sent where their input data are cached (for data locality), but not exclusively (to elastically scale with dataset popularity). 27 / 39

Distributed state To track ongoing jobs, accumulate partial histograms, and know where to find cached inputs, the query server will need to have distributed, mutable state. 28 / 39

Distributed state To track ongoing jobs, accumulate partial histograms, and know where to find cached inputs, the query server will need to have distributed, mutable state. Use third-party tools! Apache Zookeeper for rapidly changing task assignment. MongoDB for JSON-structured partial aggregations. Object store (Ceph?) for user-generated columns?...? 29 / 39

Distributed state To track ongoing jobs, accumulate partial histograms, and know where to find cached inputs, the query server will need to have distributed, mutable state. Use third-party tools! Apache Zookeeper for rapidly changing task assignment. MongoDB for JSON-structured partial aggregations. Object store (Ceph?) for user-generated columns?...? Although we can t find exactly what we want on the open-source market, we re finding most of the pieces. 30 / 39

Prototype Thanat Jatuphattharachat (CERN summer student) project: explore third party tools 1 and build a prototype system 2. 1 https://cds.cern.ch/record/2278211 2 https://github.com/jthanat/femto-mesos/tree/master 31 / 39

See Igor Mandrichenko s poster! 32 / 39

Query language 33 / 39

Femtocode Femtocode 1 is a mini-language based on Python, but with sufficient limitations to allow SQL-like query planning. Dependent type-checking to ensure that every query completes without runtime errors. Automated vectorization/gpu translation by controlling the loop structure. Integrates database-style indexing with processing. But these are all 2.0 features. 1 https://github.com/diana-hep/femtocode/ 34 / 39

Conclusions 35 / 39

Conclusions! AOD-to-plot in seconds is possible. They re doing it in industry (but... SQL and Java). Python-based queries can be computed at single-threaded rates of 10 100 MHz by translating code, rather than deserializing data. Columnar data granularity has useful consequences. See Igor s poster! Prototyping distributed architecture, relying on third-party components wherever possible. 36 / 39

Backup 37 / 39

Sample queries max p T in Python maximum = 0.0 for muon in event.muon: if muon.pt > maximum: maximum = muon.pt fill_histogram(maximum) max p T in C++ float maximum = 0.0; for (i=0; i < muons.size(); i++) if (muons[i]->pt > maximum) maximum = muons[i]->pt; fill_histogram(maximum); eta of best by p T in Python maximum = 0.0 best = -1 for muon in event.muons: if muon.pt > maximum: maximum = muon.pt best = muon if best!= -1: fill_histogram(best.eta) eta of best by p T in C++ float maximum = 0.0; Muon* best = nullptr; for (i=0; i < muons.size(); i++) if (muons[i]->pt > maximum) { maximum = muons[i]->pt; best = muon; } if (best!= nullptr) fill_histogram(best->eta); 38 / 39

Sample queries mass of pairs in Python n = len(event.muons) for i in range(n): for j in range(i+1, n): m1 = event.muons[i] m2 = event.muons[j] mass = sqrt( 2*m1.pt*m2.pt*( cosh(m1.eta - m2.eta) - cos(m1.phi - m2.phi))) fill_histogram(mass) p T sum of pairs in Python n = len(event.muons) for i in range(n): for j in range(i+1, n): m1 = event.muons[i] m2 = event.muons[j] s = m1.pt + m2.pt fill_histogram(s) mass of pairs in C++ int n = muons.size(); for (i=0; i < n; i++) for (j=i+1; j < n; j++) { Muon* m1 = muons[i]; Muon* m2 = muons[j]; double mass = sqrt( 2*m1->pt*m2->pt*( cosh(m1->eta - m2->eta) - cos(m1->phi - m2->phi))); fill_histogram(mass); } p T sum of pairs in C++ int n = muons.size(); for (i=0; i < n; i++) for (j=i+1; j < n; j++) { Muon* m1 = muons[i]; Muon* m2 = muons[j]; double s = m1->pt + m2->pt; fill_histogram(s); } 39 / 39