Spark and HPC for High Energy Physics Data Analyses

Similar documents
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

How to discover the Higgs Boson in an Oracle database. Maaike Limper

SparkSQL 11/14/2018 1

An SQL-based approach to physics analysis

A Tutorial on Apache Spark

Striped Data Server for Scalable Parallel Data Analysis

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

Cloud Computing & Visualization

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CarbonData: Spark Integration And Carbon Query Flow

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Unifying Big Data Workloads in Apache Spark

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Spark Overview. Professor Sasu Tarkoma.

Apache Spark 2.0. Matei

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Precision Timing in High Pile-Up and Time-Based Vertex Reconstruction

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs. Luca Canali CERN, Geneva (CH)

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Figure 2: download MuRun2010B.csv (This is data from the first LHC run, taken in 2010)

Higher level data processing in Apache Spark

Processing of big data with Apache Spark

Fault Detection using Advanced Analytics at CERN's Large Hadron Collider

New Developments in Spark

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

ATLAS Experiment and GCE

CSCS CERN videoconference CFD applications

Survey of data formats and conversion tools

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

DATA SCIENCE USING SPARK: AN INTRODUCTION

Introduction to Spark

Geant4 Computing Performance Benchmarking and Monitoring

ArrayUDF Explores Structural Locality for Faster Scientific Analyses

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Big Data Analytics using Apache Hadoop and Spark with Scala

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

MIP Reconstruction Techniques and Minimum Spanning Tree Clustering

The CMS data quality monitoring software: experience and future prospects

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

High-Energy Physics Data-Storage Challenges

Parallel, In Situ Indexing for Data-intensive Computing. Introduction

Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP

ACTA UNIVERSITATIS UPSALIENSIS Uppsala Dissertations from the Faculty of Science and Technology 80

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

GARCON Genetic Algorithm for Rectangular Cuts OptimizatioN

CouchDB-based system for data management in a Grid environment Implementation and Experience

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Evaluation of Parallel I/O Performance and Energy with Frequency Scaling on Cray XC30 Suren Byna and Brian Austin

Divide & Recombine (D&R) with Tessera: High Performance Computing for Data Analysis.

Analyzing massive genomics datasets using Databricks Frank Austin Nothaft,


Turning Relational Database Tables into Spark Data Sources

732A54 Big Data Analytics: SparkSQL. Version: Dec 8, 2016

MapReduce, Hadoop and Spark. Bompotas Agorakis

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

The CMS Computing Model

Massive Online Analysis - Storm,Spark

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Storage Resource Sharing with CASTOR.

Olivia Klose Technical Evangelist. Sascha Dittmann Cloud Solution Architect

SDS: A Framework for Scientific Data Services

Greenplum-Spark Connector Examples Documentation. kong-yew,chan

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Big Data Analytics Tools. Applied to ATLAS Event Data

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Data Platforms and Pattern Mining

File Access Optimization with the Lustre Filesystem at Florida CMS T2

CERN s Business Computing

Data Reconstruction in Modern Particle Physics

Extreme I/O Scaling with HDF5

VISPA: Visual Physics Analysis Environment

Big Data Architect.

Fast, Interactive, Language-Integrated Cluster Computing

Big Data Analytics and the LHC

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Grid Computing at the IIHE

A data handling system for modern and future Fermilab experiments

Big Data Tools as Applied to ATLAS Event Data

Transcription:

Spark and HPC for High Energy Physics Data Analyses Marc Paterno, Jim Kowalkowski, and Saba Sehrish 2017 IEEE International Workshop on High-Performance Big Data Computing

Introduction High energy physics (HEP) data analyses are data-intensive; 3 10 14 particle collisions at the Large Hadron Collider (LHC) were analyzed in Higgs boson discovery. Most analyses involve compute-intensive statistical calculations. Future experiments will generate significantly larger data sets. Our question Can Big Data tools (e.g. Spark) and HPC resources benefit HEP s data- and compute-intensive statistical analysis to improve time-to-physics? 2/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

A physics use case: search for Dark Matter Image from http://cdms.phy.queensu.ca/publicdocs/dm_intro.html. 3/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

The Large Hadron Collider (LHC) at CERN 4/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

The Compact Muon Solenoid (CMS) detector at the LHC 5/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

A particle collision in the CMS detector 6/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

How particles are detected 7/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Statistical analysis: a search for new particles 8/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

The current computing solution Whole-event based processing, sequential file-based solution Batch processing on distributed computing farms 28,000 CPU hours to generate 2 TB tabular data, 1 day of processing to generate GBs of analysis tabular data, 5 30 minutes to run end-user analysis Filters used on analysis data to: Select interesting events Reduce the event to a few relevant quantities Plot the relevant quantities Recorded and simulated Events (200 TB) Tabular data (2 TB) Analysis tabular data (~ GBs) Cut and count analysis several times a day 5-30 minutes Multi-Variate Analysis plots and tables ~ 4 x year ~ 1 x week ~1 day of processing every couple of days Machine Learning several times a day 9/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Why Spark might be an attractive option In-memory large-scale distributed processing: Resilient distributed datasets (RDDs): collections of data partitioned across nodes, operated on in parallel Able to use parallel and distributed file system Write code in a high level language, with implicit parallelism Spark SQL: a Spark module for structured data processing. DataFrame: a distributed collection of rows organized into named columns, an abstraction for optimized operations for selecting, filtering, aggregating and plotting structured data. Good for repeated analysis performed on the same large data Lazy evaluation used for transformations, allowing Spark s Catalyst optimizer to optimize the whole graph of transformations before any calculation Transformations map input RDDs into output RDDs; actions return the final result of an RDD calculation Tuned installation available on (some) HPC platforms 10/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

HDF5: essential features Tabular data representable as columns (datasets) in tables (groups). HDF5 is a widely-used format for the HPC systems; this allows us to use traditional HPC technologies to process these files. Parallel reading supported 11/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Overview: computing solution using Spark and HDF5 Read HDF5 files into multiple DataFrames, one per particle type. First, we had to translate from the standard HEP format to HDF5. Define filtering operations on a DataFrame as a whole (as opposed to writing loops over events). Data are loaded once in memory and processed several times. Make plots, repeat as needed. 12/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Simplified example of data Standard HEP event-oriented data organization. Tabular organization 13/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Reading HDF5 files into Spark The columns of data are organized as we want them in the HDF5 file, but Spark provides no API to read them directly into DataFrames. HDF5 Dataset 1 HDF5 Dataset 2 HDF5 Group Read Chunk HDF5 Dataset 3 HDF5 Dataset 4 Spark DataFrame Spark RDD[Rows] Transpose Apply Schema/ Convert to DataFrame Task 1 Task 2 Task 3 14/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

An example analysis Find all the events that have: missing E T (an event-level feature) greater than 200 GeV one or more electrons candidates with p T > 200 GeV eta in the range of -2.5 to 2.5 good electron quality : qual > 5 For each selected event record: missing E T the leading electron p T Some observations Range queries across multiple variables are very frequent Hard to describe by just using SQL declarative statements Relational databases that we are familiar with are unable to efficiently deal with these types of queries 15/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Coding a physics analysis with the DataFrame API val good_electrons = electrons. filter ("pt" >200). filter ( abs (" eta " ) <2.5). filter (" qual " >5). groupby (" event "). agg ( max ("pt")," eid ") val good_events = events. filter (" met " >200) val result_df = good_events. join ( good_electrons ) Using result, make a histogram of the p T of the leading electron for each good event. 16/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Measuring the performance The real analysis we implemented involves much more complicated selection criteria, and many of them. It required the use of user defined functions (UDFs). In order to understand where time is spent by Spark, we determined the time to read from HDF5 file into RDDs (step 1 ) the time to transpose RDDs (step 2 ) the time to create DataFrames from RDDs (step 3 ) the time to run analysis code (step 4 ) Tests run on Edison at NERSC, using Spark v2.0. Tested using 8, 16, 32, 64, 128 and 256 nodes. Input data consists of 360 million events 200 million electrons 0.5 TB in memory 17/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Scaling results Step1 Step2 Step3 Step4 Time for each step (seconds) 400 300 200 100 0 500 1000 1500 2000 2500 3000 Number of Cores Steps 1 3 read the files and prepare the data in memory. Different steps exhibit different (or no) scaling. Step 4 is performing the analysis on in-memory data. 18/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Lessons learned The goal of our explorations is to shorten the time-to-physics for analysis. We have observed good scalability and task distribution. However, absolute performance does not yet meet our needs. It is hard to tune a Spark system: Optimal number of executor cores, executor memory, etc. Optimal data partitioning to use with the parallel file system, e.g., Lustre File System stripe size, OST count. Difficult to isolate slow performing stages due to lazy evaluation. pyspark and SparkR high-level APIs may be appealing to the HEP community. Our understanding of Scala and Spark best practices is still evolving. Documentation and error reporting could be improved. 19/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Future work Scale up to multi-tb data sets. Compare performance with a Python+MPI approach. Improve our HDF5/Spark middleware. Evaluate the I/O performance of different file organizations, e.g. all backgrounds in one HDF5 file. Optimize the workflow to filter the data: try to remove UDFs, which prevents Catalyst from performing optimizations. 20/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

References 1. Kowalkowski, Jim, Marc Paterno, Saba Sehrish:Exploring the Performance of Spark for a Scientific Use Case. In IEEE International Workshop on High-Performance Big Data Computing.In conjunction withthe 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2016). 2. LHC http://home.cern/topics/large-hadron-collider 3. CMS http://cms.web.cern.ch 4. HDF5 https://www.hdfgroup.org 5. Spark at NERSC http://www.nersc.gov/users/data-analytics/ data-analytics/spark-distributed-analytic-framework 6. Traditional analysis code: https://github.com/mcremone/baconanalyzer 7. Our approach: https://github.com/sabasehrish/spark-hdf5-cms 21/22 HPBDC 2017 M. Paterno Spark and HPC for HEP

Acknowledgments We would like to thank Lisa Gerhardt for guidance in using Spark optimally at NERSC. This research supported through the Contract No. DE-AC02-07CH11359 with the United States Department of Energy 2016 ASCR Leadership Computing Challenge award titled An End- Station for Intensity and Energy Frontier Experiments and Calculations. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. 22/22 HPBDC 2017 M. Paterno Spark and HPC for HEP