CMS Analysis Workflow

Similar documents
Lecture 1.1. What is PAT and How to use it?

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

CRAB tutorial 08/04/2009

Experience with Data-flow, DQM and Analysis of TIF Data

Toward real-time data query systems in HEP

Lecture 2.1: Manipulating the PAT event content and workflow. Andreas Hinzmann Annapaola de Cosa

pyframe Ryan Reece A light-weight Python framework for analyzing ROOT ntuples in ATLAS University of Pennsylvania

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

How to discover the Higgs Boson in an Oracle database. Maaike Limper

Tutorial for CMS Users: Data Analysis on the Grid with CRAB

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Analysis & Tier 3s. Amir Farbin University of Texas at Arlington

The Physics Analysis Toolkit

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

Analysis of Σ 0 baryon, or other particles, or detector outputs from the grid data at ALICE

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Early experience with the Run 2 ATLAS analysis model

VISPA: Visual Physics Analysis Environment

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

The CMS data quality monitoring software: experience and future prospects

The Run 2 ATLAS Analysis Event Data Model

Data services for LHC computing

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

The ATLAS Distributed Analysis System

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

handling of LHE files in the CMS production and usage of MCDB

NAF & NUC reports. Y. Kemp for NAF admin team H. Stadie for NUC 4 th annual Alliance Workshop Dresden,

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Big Data Analytics Tools. Applied to ATLAS Event Data

Introduction to High-Performance Computing (HPC)

Outline. ASP 2012 Grid School

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

Data handling and processing at the LHC experiments

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction.

Status of KISTI Tier2 Center for ALICE

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

PROOF-Condor integration for ATLAS

ROOT: An object-orientated analysis framework

Summary of the LHC Computing Review

ATLAS Analysis Workshop Summary

Data Transfers Between LHC Grid Sites Dorian Kcira

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Genomics on Cisco Metacloud + SwiftStack

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

Recasting with. Eric Conte, Benjamin Fuks. (Re)interpreting the results of new physics searches at the LHC June CERN

Computing at Belle II

An SQL-based approach to physics analysis

Prompt data reconstruction at the ATLAS experiment

UW-ATLAS Experiences with Condor

Recasting LHC analyses with MADANALYSIS 5

HEP Grid Activities in China

Distributed Monte Carlo Production for

The Global Grid and the Local Analysis

DELL EMC DATA DOMAIN EXTENDED RETENTION SOFTWARE

Compression Update: ZSTD & ZLIB. Oksana Shadura, Brian Bockelman University of Lincoln Nebraska

Real-time dataflow and workflow with the CMS tracker data

TRIREME Commander: Managing Simulink Simulations And Large Datasets In Java

Simulation and Visualization in Hall-D (a seminar in two acts) Thomas Britton

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

A new petabyte-scale data derivation framework for ATLAS

Monte Carlo Production on the Grid by the H1 Collaboration

Collider analysis recasting with Rivet & Contur. Andy Buckley, University of Glasgow Jon Butterworth, University College London MC4BSM, 20 April 2018

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Spark and HPC for High Energy Physics Data Analyses

Computing Model Tier-2 Plans for Germany Relations to GridKa/Tier-1

EventStore A Data Management System

Introduction to Grid Computing

Setup InstrucIons. If you need help with the setup, please put a red sicky note at the top of your laptop.

Expressing Parallelism with ROOT

Data and Analysis preservation in LHCb

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

2009 Cascade Filter Proposal

To the Cloud and Back: A Distributed Photo Processing Pipeline

Grid Computing Activities at KIT

HIGH ENERGY PHYSICS ON THE OSG. Brian Bockelman CCL Workshop, 2016

HSC Data Processing Environment ~Upgrading plan of open-use computing facility for HSC data

Event Displays and LArg

CSE 124: Networked Services Lecture-17

Sizing DITA CMS Server Components

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

Physics Analysis Software Framework for Belle II

LHCb Computing Resource usage in 2017

HFIP Architecture and Real-Time Workflows. 14 April 2011

Python Scripting: Michael Potts, GISP. Geodatabase Administrator.

Outline. The History of Histograms. Yannis Ioannidis University of Athens, Hellas

<Insert Picture Here> Introducing Oracle WebLogic Server on Oracle Database Appliance

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

Existing Tools in HEP and Particle Astrophysics

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

Global Software Distribution with CernVM-FS

Challenges of the LHC Computing Grid by the CMS experiment

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Data Management for the World s Largest Machine

Updates from Edinburgh Computing & LHCb Nightly testing

The Hangover 2.5: Waking Up to Exchange 2010 Archiving and ediscovery Realities. Vision Session IM B29

1. Introduction. Outline

Transcription:

CMS Analysis Workflow Sudhir Malik Fermilab/University of Nebraska- Lincoln, U.S.A. 1

CMS Software CMS so)ware (CMSSW) based on Event Data Model (EDM) - as event data is processed, products stored in the event as reconstructed (RECO) data objects (each TTrees in ROOT represents an object, C++ container) Structure flexible for reconstrucgon, not usability, informagon related to a physics object (e.g. tracks), stored in a different locagon (e.g. track isolagon) Data processing steered via Python job configuragons CMS also provides a lighter version of full CMSSW called FWLite (FrameWorkLite) FWLite is plain ROOT + data format libraries and dicgonaries capable to read CMS data + some basic helper classes 2

Guidelines and Boundary conditions CMS has set some boundary condigons in the analysis structure, given the geographical spread of the collaboragon, high cost of the program and the potengal of major discoveries and breakthroughs. Physics results are of a very high quality in nature Understood and largely reproducible by other collaboragon members Reproducible long a)er the result is published Datasets and skims selected on the basis of persistent informagon and the retengon of provenance informagon While most of the tools are officially blessed, the analysis model has enough flexibility to support those analyses that require special datasets and tools So)ware plauorm is ScienGfic Linux on which CMS develops and runs validagon code Physics objects and algorithms approved by the corresponding Physics Object Groups (POGs, experts in physics object idengficagon algorithms in CMS) The origin of the samples should be fully traceable by the provenance informagon Analysis code on the user sample fully reviewable by Physics Analysis Groups (PAGs, experts providing physics leadership in CMS) 3

Flexible and Distributed Computing 4

Data Tiers and Flow 5

Data Analysis Workflow Start from RECO/AOD on T2. Create private skims with less and less events in one or more steps. Fill histograms, ntuples, perform complex fits, calculate limits, toss toys,... Document what you have done and PUBLISH 6

Physics Analysis Toolkit To ease this situagon and keeping the goal of physics analysis, a special so)ware layer called PAT (Physics Analysis Toolkit) was developed facilitates access to event informaton Combines - > flexibility + user friendliness + maximum configurability provenance, uses official tools/code, one can slim/trim/drop event info User AOD Tier RECO Tier 7

Ways to analyze data Directly using RECO/AOD objects Need expert knowledge to know all available features keep up- to- date with all POGs For Experts Producing and using PAT objects All up- to- date features collected from all POGs Get latest algorithms always from the same interface: the PAT objects For Users 8

Ways to store data RECO/AOD skim Full flexibility Need a lot of space PATtuple Full flexibility Save space by embedding Increasing Flexibility EDMtuple Not flexible, need to know exact set of objects, variables Provenance information is kept Decreasing Size Increasing Speed Tntuple Give up provenance information Not officially supported 9

PATtuple The challenge/problem: Analyst wants all relevant informagon from RECO/AOD stored in an intuagve and compressed way RECO/AOD can only be skimmed by keeping and dropping complete branches, but not selected objects from branches SoluGon: PAT objects Embedding allows to keep only relevant informagon All relevant informagon from a single interface for each physics object Data stored in form of PAT objects (C++ classes) 10

EDMtuple The challenge/problem: TNtuple is the fastest way to store and plot data TNtuple lacks EDM provenance informagon needed for reproducible analyses! e.g. for proper calculagon of uminosity TNtuple lacks integragon with GRID so)ware SoluGon: Storage of a set of user- defined doubles in EDM Format As fast as TNtuple and provenance informagon! Data stored in form of a flat ntuple (vector of doubles) 11

run PAT on Grid Common WorkFlows Analysis development phase: Keep all relevant informagon in a PAT tuple Be fully flexible for object and algorithm changes run PAT on Grid run PAT on Grid run analysis on cluster / locally run analysis on cluster / locally RECO/AOD Ntuple plots RECO/AOD PATtuple plots RECO/AOD PATtuple run analysis on cluster Analysis update with more data: Keep all necessary informagon in a small EDM tuple Use highest performance format for fast updagng of the analysis EDMtuple locally plots RECO/AOD run PAT and analysis on Grid EDMtuple locally plots 12

Analysis Practices While software and physic tool determine the data preservation plan, its success depends on the analysis practices followed CMS is very young experiment, hence fortunate to plan data preservagon It has a strong culture of documentagon WorkBook, So)WareGuide It has strong culture of User Support MulGple Physics Analysis Schools per year All the above iron out usage of tools and so)ware and serve as basis for long term preservagon for ourselves In the following slides is a brief preview of these pracgces 13

n- tuple usage Ntuples is the popular data format for final plots for analysis 10 Freq of Numb of Analysis sharing ntuples No. of Ntuples 8 6 4 2 0 Number of Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 12 Freq of NtuplizaTon No. of Analyses 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 25 Number of Tmes ntuplized Means 4 analyses Ntuplized 10 times 14

How long does it take for you to n-tuplize? 9 Days it took to ntuplize Number of Analyses 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 14 15 20 25 30 60 Days to ntuplize 15

Reasons to recreate ntuples Change of definition of lepton isolation Change of object Id Change of jet calibration Change of MET definition Change of a high level analysis quantity Change of an event pre-selection Change of the definition of object cross cleaning / event interpretation 16

Number of grid jobs run to create ntuples Number of grid jobs run to create ntuples 500 100K Some use condor batch Typical running time for your grid jobs to produce your n- tuple (running time of a single job) Few hours to less than a week Depends on grid sites health

Where do you store the output of your n-tuple store results option of CRAB T2/T3 laptop castor eos@cern Local shared filesystem other 10% also store at eos@fnal 18

Disk Space n-tuple (in GB) MC 6 Frequency 5 4 3 2 1 0 Up To 20 20 To 227 227 To 2000 2000 To 4250 More Space in GB Data 6 Frequency 5 4 3 2 1 0 Up To 2 2 To 130 130 To 2000 2000 To 6550 More Space in GB 19

What is the size of your n-tuple per event (in KB/ event)? 6 5 4 3 2 1 0 Up To 1 1 To 33 33 To 61 More Size in Kb 20

How do you foresee to preserve your analysis? Where do you keep the code with which analysis was performed 21

Plan to preserve your analysis code for future? 22

Store ntuple for data preservation and disk space needed? disk space needed? 80 GB to 20 TB 23

Summary CMS has a flexible software framework that covers all software needs from data taking to physics analysis This is a major plus for data preservation CMS analysis is mostly a two step process Computing intensive part on the grid Final analysis, plots on local computing Majority analysis use PATtuples in physics analysis Final plots are made using ntuples and their shelf life is the duration an analysis is done But storing PAT tuples and AOD data gives ability to generate ntuples with modification and changes and redo variations of analysis Data storage is space intensive and analysis computing analysis Key analysis should be preserved to the extent of level 3 Given current practices in CMS robust documentation, CMSSW Reference Manual, Data Analysis schools etc., preserving data analysis for ourselves should be able to be implemented 24

Useful Links CMS public page http://cms.web.cern.ch/ CMS code respository http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/ CMS Reference Manual http://cmssdt.cern.ch/sdt/doxygen// CMS WorkBook https://twiki.cern.ch/twiki/bin/view/cmspublic/workbook/ CMS Data Analysis Schools https://twiki.cern.ch/twiki/bin/view/cms/workbookexercisescmsdataanalysisschool 25