Big Data Analytics and the LHC

Similar documents
Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Data Transfers Between LHC Grid Sites Dorian Kcira

The CMS Computing Model

CERN s Business Computing

Tackling tomorrow s computing challenges today at CERN. Maria Girone CERN openlab CTO

ATLAS Experiment and GCE

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Grid Computing a new tool for science

Virtualizing a Batch. University Grid Center

Summary of the LHC Computing Review

Storage and I/O requirements of the LHC experiments

Overview. About CERN 2 / 11

The LHC Computing Grid

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Data services for LHC computing

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

1. Introduction. Outline

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

CSCS CERN videoconference CFD applications

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

CMS Computing Model with Focus on German Tier1 Activities

Travelling securely on the Grid to the origin of the Universe

CernVM-FS beyond LHC computing

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Grid Computing Activities at KIT

Computing at Belle II

RADU POPESCU IMPROVING THE WRITE SCALABILITY OF THE CERNVM FILE SYSTEM WITH ERLANG/OTP

IEPSAS-Kosice: experiences in running LCG site

High-Energy Physics Data-Storage Challenges

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 27 July 2010

Andrea Sciabà CERN, Switzerland

ATLAS 実験コンピューティングの現状と将来 - エクサバイトへの挑戦 坂本宏 東大 ICEPP

Distributed e-infrastructures for data intensive science

Experience of the WLCG data management system from the first two years of the LHC data taking

Visita delegazione ditte italiane

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Data Reconstruction in Modern Particle Physics

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

CouchDB-based system for data management in a Grid environment Implementation and Experience

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Grid Computing at the IIHE

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Data handling and processing at the LHC experiments

An SQL-based approach to physics analysis

Towards Network Awareness in LHC Computing

UW-ATLAS Experiences with Condor

The LHC Computing Grid

Batch Services at CERN: Status and Future Evolution

HTC/HPC Russia-EC. V. Ilyin NRC Kurchatov Institite Moscow State University

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Fault Detection using Advanced Analytics at CERN's Large Hadron Collider

First LHCb measurement with data from the LHC Run 2

BIG DATA TESTING: A UNIFIED VIEW

The LHC computing model and its evolution. Dr Bob Jones CERN

First Experience with LCG. Board of Sponsors 3 rd April 2009

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

CC-IN2P3: A High Performance Data Center for Research

How to discover the Higgs Boson in an Oracle database. Maaike Limper

CERN Open Data and Data Analysis Knowledge Preservation

The Grid: Processing the Data from the World s Largest Scientific Machine

N. Marusov, I. Semenov

Exploring cloud storage for scien3fic research

Machine Learning in Data Quality Monitoring

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

We invented the Web. 20 years later we got Drupal.

Federated data storage system prototype for LHC experiments and data intensive science

Scientific Computing on Emerging Infrastructures. using HTCondor

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Analytics Platform for ATLAS Computing Services

CERN and Scientific Computing

Grid Computing: dealing with GB/s dataflows

Prompt data reconstruction at the ATLAS experiment

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

CHIPP Phoenix Cluster Inauguration

PoS(EGICF12-EMITC2)106

Storage Resource Sharing with CASTOR.

PROOF-Condor integration for ATLAS

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Data Handling for LHC: Plans and Reality

Data Analysis in Experimental Particle Physics

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

Accelerating Throughput from the LHC to the World

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

Transcription:

Big Data Analytics and the LHC Maria Girone CERN openlab CTO Computing Frontiers 2016, Como, May 2016 DOI: 10.5281/zenodo.45449, CC-BY-SA, images courtesy of CERN

2

3

xx 4

Big bang in the laboratory We gain insight by colliding particles at the highest energies possible to measure: Production rates Masses & lifetimes Decay rates From this we derive the spectroscopy as well as the dynamics of elementary particles. Progress is made by going to higher energies and brighter beams. Higher energies allow higher mass particles to be produced Brighter beams allow rarer phenomena to be probed brighter beams means more collisions per event => more complex events

6

CERN: A UNIQUE ENVIRONMENT TO PUSH TECHNOLOGIES TO THEIR LIMITS In its 60 year life CERN has made some of the important discoveries in particle physics Observation of the W and Z Bosons The number of neutrino families The Higgs Boson Discovery Maria Girone CERN openlab CTO Computing Frontiers 2016 7

CERN Where the Web was Born 8

The Large Hadron Collider (LHC) Maria Girone CERN openlab CTO Computing Frontiers 2016

The CMS Experiment 80 Million electronic channels x 4 bytes x 40MHz ----------------------- ~ 10 Petabytes/sec of information x 1/1000 zero-suppression x 1/100,000 online event filtering ------------------------ ~ 1000 Megabytes/sec raw data to tape 50 Petabytes of data per year written to tape, incl. simulations. 2000 Scientists (1200 Ph.D. in physics) ~ 180 Institutions ~ 40 countries 12,500 tons, 21m long, 16m diameter

CERN Computer Centre (Tier-0): Acquisition, First pass reconstruction, Storage & Distribution LHCb ~50 MB/sec 2015: ~700 MB/sec ATLAS ~320 MB/sec 2015: ~1 GB/sec CMS ~220 MB/sec 2015: ~600 MB/sec ALICE 1.25 GB/sec (ions) 2015: ~4GB/sec

A Needle in a Haystack 12 13

Data Mining 13 Selecting a new physics event at LHC is like choosing 1 grain of sand in 20 volley ball courts

LHC Compu7ng Scale 1 PB/s of data generated by the detectors Up to 100 PB/year of stored data A distributed computing infrastructure of half a million cores working 24/7 An average of 40M jobs/month A sample equivalent to the accumulated data/simulation of the 10 year LEP program is produced 5 times a day Would put us amongst the top Supercomputers if centrally placed More than 100PB moved and accessed by 10k people An continuous data transfer rate of 6 GB/s (600TB/day) across the Worldwide LHC Grid (WLCG) Maria Girone CERN openlab CTO Computing Frontiers 2016 TB 500000 400000 300000 200000 100000 0 2013 2014 2015 2016 LHC Disk and Tape Storage 14 Disk Tape

Global Distribu8on Open Science Grid Largest national contribution is only 24% of total resources. 10/29/15 DIBS'15 15

OSG since Incep8on 120 Million hours/month 80 Million hours/month 40 Million hours/month CMS ATLAS Accounting was not available at inception 2010 2015 The Large Hadron Collider Experiments ATLAS & CMS operate on a 180,000 core distributed infrastructure in the US alone

Big Data Analytics

Processing and people Organized Analysis Analysis Users (lmore than 1000) Analysis Selection (100MB/s) Derived Data (2GB/s) Derived Data (2GB/s) This is a complicated series of steps at the LHC (Run2) Analysis Users (lmore than 1000) Reprocessing 60k cores Operations (less than100) Reconstruction 30k cores Operations (less than 100) HLT 40k cores Selected RAW (1GB/s) After HardwareTrigger (TB/s) DAQ and Trigger (less than 200) 20k cores Data Volume From Detector (1PB/s) Data Analytics at the LHC Data Analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information Final Selection 18

Data to Manage Datasets => distributed globally Calibration Releases => distributed globally Software Releases => distributed globally A typical physicist doing data analysis uses custom software & configs on top of a standardized software release re-applies some high level calibrations does so uniformly across all datasets used in the analysis.

Software & Calibrations Both are distributed via systems that use Squid caches. Calibrations: Frontier system is backended by an Oracle DB Software: CVMFS is backended by a filesystem Data distribution achieved via globally distributed caching infrastructure.

Calibration Data For details see e.g.: Operational Experience with the Frontier System in CMS Journal of Physics: Conference Series 396 (2012) 052014 doi:10.1088/1742-6596/396/5/052014

Frontier System Architecture Infrastructure at ~ 100 clusters globally ~ 1 Billion requests per day in aggregate ~ 40TB of data per day Application Squid@cluster Infrastructure @ CERN ~ 5 Million requests per day serving ~ 40GB of data

Software Distribution For details see e.g.: The need for a versioned Analysis Software Environment arxiv:1407.3063 [cs.se] and references therein.

Software @ LHC Software is a team sport distributed via CVMFS Individual Analysis Code Experiment Software Framework High Energy Physics Libraries Grid Libraries System Libraries OS Kernel 0.1 MLOC 4MLOC 5MLOC 20 MLOC stable changing CMS ~1/4 of the CMS collaboration, scientists and engineers, contributed to the common source code of ~3.6M C++ SLOC.

CVMFS Architecture CernVM 2 / Standard SL5 Worker Node SL5 Kernel CernVM-FS Fuse Hierarchy of HTTP Caches Linux File System Buffers 1 GB CernVM-FS Cache least-recently-used replacement 10 GB Single Release (all releases available) Release Statistics for 2013: Releases Shared libraries and plug-ins per release ATLAS 19 3900 CMS 40 2200 ALICE 49 210

Dataset Distribution ~ 50PB per year per experiment

Dataset Distribution Strategies Managed Pre-staging of datasets to clusters managed based on human intelligence managed based on data popularity Data Transfer integrated with processing workflow determine popularity dynamically based on pending workloads in WMS. Remote file open & reads via data federation Dynamic caching just like for calibrations & software.

ATLAS Data Total on Disk Data managed across 130 storage clusters worldwide 170 PB, 700M files on disk today

Any Data, Any Time, Anywhere: Global Data Access for Science http://arxiv.org/abs/1508.01443 making the case for WAN reads.

Optimize Data Structure for Partial Reads!"#$%&"'#()*#+)",#-.//&-0."*#.1#.23&-'*#45'(#+)",#-.+6."&"'*7# =4,C@$C-55:C(-.H $_$=4,C@ ] K$=4,C@^K$`$=4,C@ [D $a $,$>:?$]bb/$':4$:j:.d$ c:d$c-55:c(-.h $ $_$c:d ] K$c:D^K$`$c:D [B $a$ $ $,$>:?$]b/$':4$:j:.d$ 15:CD4-.$C-55:C(-.H$_$15:CD4-. ] K$`$a$ $ $ $,$C-A'5:K$*>$,D$,55$ FZ,/@:DG$ =4,C@$' = $>-4$:J:.D/$]$D-$1 ]$ =4,C@$' +K $:J#$]$D-$1 ]$ =4,C@$dK$:J#$]$D-$1 ]$ `$ c:d$1 E,;4-.*C K$:J#$]$D-$1 ]$ 15:CD4-.$1 1W K$:J#$]$D-$1 ]$ `$ =4,C@$' = $>-4$:J:.D/$1 ]$ D-$1^$ =4,C@$' +K $:J#$1 ]$ D-$1^$ =4,C@$dK$:J#$1 ]$ D-$1^$ `$ c:d$1 E,;4-.*C K$:J#$1 ]$ D-$1^$ 15:CD4-.$1 1W K$:J#$1 ]$ D-$1^$ `$ 1,CE$F9,/@:DG$C-)'4://:;$/:',4,D:56$ef$-'()*+:;$D-$:gC*:.D56$4:,;H$ ',4(,5$:J:.DK$:#I#K$-.56$=4,C@/h$ ',4(,5$-9B:CDK$:#I#K$-.56$15:CD4-.$1 1W $D-$;:C*;:$*>$DE:$:J:.D$*/$*.D:4:/(.I$,D$,55$$ P-4$7WVK$C5A/D:4$/*+:$-.$;*/@$*/$i$j$^b$WZ$-4$]b$j$kb$:J:.D/$ =-D,5$35:$/*+:$>4-)$]bb$WZ$D-$]b$\Z$ `$ `$ C5A/D:4 ]$ C5A/D:4^$ 75A/D:4/$->$1J:.D/$

Data Federation for WAN Reads Applications connect to local/regional redirector. Redirect upwards only if file does not exist in tree below. Minimizing WAN read access latency this way. Xrootd US Redirector Global Redirector Xrootd EU Redirector Xrootd local Xrootd local Redirector Redirector XRootd XRootd XRootd XRootd XRootd XRootd Data Server Data Server Data Server Data Server Data Server Data Server Many Clusters in US Xrootd local Xrootd local Redirector Redirector XRootd XRootd XRootd XRootd XRootd XRootd Data Data Server Data Server Data Server Server Data Server Data Server Many Clusters in EU

HEADING TOWARDS THE FUTURE 32

LHC Schedule 33

LHC Run3 and Run4 Scale and Challenges 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2030? First run LS1 Second run LS2 Third run LS3 HL-LHC FCC? PB Raw data volume for LHC increases exponentially and with it processing and analysis load Technology at ~20%/year will bring x6-10 in 10-11 years Estimates of resource needs at HL- LHC x10 above what is realistic to expect from technology with reasonably constant cost 1000 900 800 700 600 500 400 300 200 100 0 Data estimates for 1st year of HL-LHC (PB) ALICE ATLAS CMS LHCb Raw Derived CPU Needs for 1st Year of HL-LHC (khs06) 250000 ALICE ATLAS CMS LHCb 200000 150000 100000 50000 0 CPU (HS06) Technology revolutions are needed to close the resource gap Data: Raw 2016: 50 PB à 2027: 600 PB Derived (1 copy): 2016: 80 PB à 2027: 900 PB Courtesy of I. Bird CPU: x60 from 2016

Processing > Amazon has more than 40 million processor cores in EC2 PB 500000 400000 300000 200000 100000 0 2013 2014 2015 2016 LHC Disk and Tape Storage Our data and processing problems are ~1% the size of the largest industry problems Relative Size of Things Storage Industry > Amazon has 2x10 12 unique user objects and supports 2M queries per second > Google has 10-15 exabytes under management > Facebook 300PB > ebay collected and accessed the Disk Tape same amount of data as LHC Run1 Google Facebook HEP HEP 0 2 4 6 8 10 12 35 EB

Data Analytics to the Rescue? How to make more effective use of the data collected is critical to maximise scientific discovery and close the resource gap There are currently ongoing projects in Accelerator system controls Data handling and quality optimizations Data reduction Optimized formats Investigations for machine learning for analysis and event categorization 36

Data Classification Data Reduction Center The increase in data volume expected by the experiments in Run3 and Run4 changes the operations mode and opens up possibilities for analysis In ALICE and LHCb events will leave the detector essentially reconstructed with final calibrations Final Selection Analysis Users (lmore than 1000) 60k cores Derived Data (2GB/s) Resource Optimization Analysis Users (lmore than 1000) Organized Analysis Automated Analysis Custom user samples Derived Data (2GB/s) Data Reduction Analysis Selection (100MB/s) Data Analytics Analysis can start immediately (maybe even online) ATLAS and CMS will both have much higher triggers and a desire to streamline analysis Interest to look at industry tools for improved data analysis SPARK, Hadoop, etc. Community building around analysis challenges, e.g. https://www.kaggle.com/c/flavours-of-physics 37

Data Analytics Data Reduction Data Classification Operations (less than 50) Resource Optmisation Reconstruction 20k cores Automated Analysis The Data Quality Monitoring (DQM) is a key to delivering high-quality data for physics. It is used both in the online and offline environments Currently involves scrutinizing of a large number of histograms by detector experts comparing them with a reference Aim at applying recent progress in Machine Learning techniques to the automation of the DQM scrutiny The LHC is the largest piece of scientific apparatus ever built There is a tremendous amount of real time monitoring information to assess health and diagnose faults The volume and diversity of information makes this an interesting application of big data analytics 38

Use machine learning techniques to predict how data should be placed and processing resources scheduled in order to achieve a dramatic reduction in latency for delivering data samples to analysts Design a system capable of using information about resource usage (disk access, CPU efficiency, job success rates, data transfer performances, and more) to make more automated decisions about resource allocation Data Reduction Automated Analysis Resource Optimization Data Analytics Data Classification Reconstruction Reprocessing Organized Analysis 39

Data Analytics 40 Data Reduction Automated Analysis Resource Optimization Data Classification Data Volume All Data (TB/s) Investigate the possibility of performing realtime event classification in the high-level trigger systems of LHC experiments Extract information from events that would otherwise be rejected Uncategorized events might potentially be the most interesting, revealing the presence of new phenomena in the LHC data Event classification would allow both a more efficient trigger design and an extension of the physics program, beyond the boundaries of the traditional trigger strategies

Engagement with Industry 41 Industry has focused on improving data analytics techniques to solve large scale industrial problems and analyzing social media data Has provided a common ground for developments and activities with industry Leaders in the area of big data, machine learning and data analytics High Energy Physics has long been a leader in data analysis and data handling, but industry has closed the gap in this area

Summary and Outlook 42 The LHC today operates a globally distributed 500,000 X86 core infrastructure ingesting more than 100PBytes per year. The LHC is planning to dramatically increase the volume and complexity of data collected by Run3 and Run4 This results in an unprecedented computing challenge in the field of High Energy Physics Meeting this challenge within a realistic budget requires rethinking how we work Need to shrink computing needs by roughly x10 over naïve extrapolations of what we do today. Turning to industry and other sciences for improvements in data analytics Data reduction and automated analysis through machine learning techniques need to be investigated