Data and Analysis preservation in LHCb

Similar documents
Overview of LHCb applications and software environment. Bologna Tutorial, June 2006

LHCb Computing Strategy

EventStore A Data Management System

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Helix Nebula The Science Cloud

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

Servicing HEP experiments with a complete set of ready integreated and configured common software components

Data handling and processing at the LHC experiments

How the Monte Carlo production of a wide variety of different samples is centrally handled in the LHCb experiment

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

Long term data preservation and virtualization

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Long Term Data Preservation for CDF at INFN-CNAF

The ATLAS EventIndex: Full chain deployment and first operation

The ATLAS Distributed Analysis System

LHCb Distributed Conditions Database

LHCb Computing Technical Design Report

Open data and scientific reproducibility

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

LHCb Computing Resource usage in 2017

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

The CMS Computing Model

First LHCb measurement with data from the LHC Run 2

ATLAS PILE-UP AND OVERLAY SIMULATION

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

GAUDI - The Software Architecture and Framework for building LHCb data processing applications. Marco Cattaneo, CERN February 2000

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

Computing at Belle II

Calorimeter Object Status. A&S week, Feb M. Chefdeville, LAPP, Annecy

The High-Level Dataset-based Data Transfer System in BESDIRAC

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

ATLAS Analysis Workshop Summary

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Configuration and Build System

Initial Explorations of ARM Processors for Scientific Computing

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

CEDAR: HepData, JetWeb and Rivet

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

Large Scale Software Building with CMake in ATLAS

TAG Based Skimming In ATLAS

HEP Software Installation

Storage and I/O requirements of the LHC experiments

SoLID Software Framework

Computing strategy for PID calibration samples for LHCb Run 2

1.Remote Production Facilities

v1.5 Latest version: v16r2 Panoramix: Interactive Data Visualization for LHCb Author: Guy Barrand (Framework) + contribution from many individuals

Existing Tools in HEP and Particle Astrophysics

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

UW-ATLAS Experiences with Condor

Track pattern-recognition on GPGPUs in the LHCb experiment

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Introducing LCG Views. Pere Mato LIM Meeting, 16th January 2016

RELEASE OF GANGA. Basics and organisation What Ganga should do tomorrow Ganga design What Ganga will do today Next steps

Summary of the LHC Computing Review

Monte Carlo programs

Belle II - Git migration

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

CouchDB-based system for data management in a Grid environment Implementation and Experience

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Guillimin HPC Users Meeting February 11, McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Offline Tutorial I. Małgorzata Janik Łukasz Graczykowski. Warsaw University of Technology

Reprocessing DØ data with SAMGrid

ALIBUILD / PANDADIST. The New Build System for Panda

How Can We Deliver Advanced Statistical Tools to Physicists. Ilya Narsky, Caltech

Computing Resources Scrutiny Group

DIRAC data management: consistency, integrity and coherence of data

Distributed Data Management on the Grid. Mario Lassnig

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The ATLAS Production System

Analytics Platform for ATLAS Computing Services

Mitigating Risk of Data Loss in Preservation Environments

and the GridKa mass storage system Jos van Wezel / GridKa

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

Prompt data reconstruction at the ATLAS experiment

How to install and build an application

PROOF-Condor integration for ATLAS

Managing Petabytes of data with irods. Jean-Yves Nief CC-IN2P3 France

AMGA metadata catalogue system

Data Base and Computing Resources Management

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Ganga - a job management and optimisation tool. A. Maier CERN

Performance quality monitoring system (PQM) for the Daya Bay experiment

Ariadne: a Tracking System for Relationships in LHCb Metadata

Large Scale Management of Physicists Personal Analysis Data Without Employing User and Group Quotas

The LHCb upgrade. Outline: Present LHCb detector and trigger LHCb upgrade main drivers Overview of the sub-detector modifications Conclusions

Metadata Models for Experimental Science Data Management

How to discover the Higgs Boson in an Oracle database. Maaike Limper

Data storage services at KEK/CRC -- status and plan

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

Geant4 on Azure using Docker containers

Update of the Computing Models of the WLCG and the LHC Experiments

The LHC Computing Grid

Distributed Data Management with Storage Resource Broker in the UK

Updates from Edinburgh Computing & LHCb Nightly testing

1. Introduction. Outline

Status of KISTI Tier2 Center for ALICE

Backup and Restore Operations

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications

Transcription:

Data and Analysis preservation in LHCb - March 21, 2013 - S.Amerio (Padova), M.Cattaneo (CERN)

Outline 2 Overview of LHCb computing model in view of long term preservation Data types and software tools Data lifecycle and organization Data storage Software organization Status of long term data preservation at LHCb Motivation and use cases Open access policy

LHCb Data types 3 RAW data - Format: sequential files of integers and doubles - Size: 60 kb/event, 3 GB/file, ~ 250 files /run FULL.DST (Data Summary Tape Reconstruction Output) -Format: ROOT-based file containing all reconstructed objects for each event plus a copy of the raw data - Size: 130 kb/event on average, 3 GB DST streams -Format: ROOT-based file containing all of the event's information - Size: 120 150 kb/event, 5 GB MDST (MicroDST streams) -Format:as DST; it contains only the part of the event contributing to the decay that triggered the stripping line - Size: 10-15 kb/event, 3 GB Approximately 1.2 millions of files, 3.3 millions including replicas

Software tools (I) 4 Data Processing AIDA Boost CLHEP COOL CORAL CppUnit Frontier_Client GCCXML GSL HepMC HepPDT Python QMtest Qt RELAX ROOT Reflex XercesC cernlib fastjet fftw graphviz libunwind neurobayes_expert oracle pyanalysis pygraphics pytools sqlite tcmalloc uuid xqilla xrootd gfal Simulation : GEANT 4 alpgen herwig++ hijing lhapdf photos++ pythia6 pythia8 rivet tauola++ thepeg Only proprietary tools: - Oracle client - neurobayes_expert Processing of data done on Intel X86 machines (x86_64, with older software running on 32 bits) All of the software runs on Scientific CERN Linux operating system. Currently SLC5 but migration to SLC6 is nearly complete.

Software tools (II) 5 Data Processing AIDA Boost CLHEP COOL CORAL CppUnit Frontier_Client GCCXML GSL HepMC HepPDT Python QMtest Qt RELAX ROOT Reflex XercesC cernlib fastjet fftw graphviz libunwind neurobayes_expert oracle pyanalysis pygraphics pytools sqlite tcmalloc uuid xqilla xrootd gfal Simulation : GEANT 4 alpgen herwig++ hijing lhapdf photos++ pythia6 pythia8 rivet tauola++ thepeg Gaudi Online LHCb Lbcom Boole Rec Brunel Phy Hlt Analysis Stripping DaVinci Panoramix Bender Moore On top of these tools, a number of applications are developed for LHCb for simulation, data reconstruction, analysis... The production system relies on the DIRAC Grid Engine (customized into LHCbDirac) The same software is used for data analysis. Jobs are sent to the grid using the Ganga grid frontend.

Data lifecycle (I) 6 Data rate(per second of stable beam) Raw data 300 MB/s Reconstruction (Brunel) Size: 200% of raw data files FULL.DST Stripping and Streaming (DaVinci) 650 MB/s Size: - 10% of FULL.DST - 20% of Raw DST/MDST 60 MB/s

Data lifecycle (II) 7 Raw data Reconstruction (Brunel) FULL.DST Stripping and Streaming (DaVinci) DST/MDST Fftw neurobayes_expert RELAX oracle sqlite HepMC CLHEP XercesC uuid AIDA GSL CppUnit xrootd GCCXML libunwind tcmalloc Python pytools Boost CORAL COOL Packages per each stage blas lapack fastjet fftw neurobayes_expert RELAX oracle sqlite HepMC CLHEP XercesC uuid AIDA GSL CppUnit xrootd GCCXML libunwind tcmalloc Python pyanalysis pygraphics pytools Boost CORAL

Computing and storage resources 8

Data storage 9 Storage Data are stored on tape and disk at T0 and T1 sites. Backup copies: Raw data: 2 copies on tape: 1 at T0, 1 in one of the T1s FULL.DST: 1 copy on tape, for the most recent reprocessing DST/MDST: 4 copies on disk for the latest reprocessing N (3 on T0D1 storage at selected T1s + 1 at T0) 2 copies for the (N-1) reprocessing 1 archive replica on T1D0 either at T0 or at one T1. Safety measures Standard protections provided by mass storage systems via ACLs Recovery mechanisms Raw data: two copies in geographically distinct locations known losses can be recovered but no systematic checks on tape integrity FULL.DST: can be recovered rerunning the reconstruction DST for live data, mechanims in place to recover a lost disk copy from other disk copies for archived data we rely on single tape copy, no distaster recovery (but DST files can be regenerated)

Data organization and description 10 The Data Management system is based on two catalogues: A replica catalogue contains the whole information on individual files and their replicas. It is used for data manipulation and job brokering. A book-keeping catalogue contains information on the provenance of the files; it is used by physicists to select the files they want to analyse. Users can access this catalogue using a standalone GUI or using the LHCbDirac web portal. All data other that raw is in ROOT format, but with LHCb specific data description. Simulation uses HepMC for interface to generators. Decoding is handled by the LHCb software framework, and data is described in header files and twiki.

Software organization (I) 11 Software is organized in the context of CMT packages and projects: Package: set of functions with a well defined functionality; basic unit of software development; C++/Python Project: set of packages grouped together according to some functionalities, eg: Three projects for the framework (Gaudi, LHCb, Phys) Several component projects for algorithms (e.g Rec, Hlt, Analysis,...) One project per application containing the project application configuration (e.g. Brunel, DaVinci,..) All software is versioned in SVN repositories User level scripts and macros are often maintained in private areas).

Software organization (II) 12 Book-keeping knows which version of an application was used to produce each dataset. CMT versioning machinery contains list of every SVN tag used to build a given version of the application. We maintain an informal list of application version consistent with a given version of the data. The software production chain (simulation/trigger/reconstruction/stripping) is guaranteed to be self-consistent only for a limited set of versions (one major version plus patches). Latest analysis software can in principle process all existing versions of the data. Latest version of the reconstruction can process all Raw data.

Long term data preservation 13 Goal LHCb data and complete analysis capabilities should be preserved in the long term future. Motivation LHCb data will maintain a high physics potential in the future. As an example, consider 2010 data: these samples are unique in terms of data taking conditions: Almost unbiased data, no pile-up due to low luminosity Three different center-of-mass energy: 0.9, 2.76 and 7 TeV Physicists inside and outside the collaboration may be interested in revisiting LHCb data To perform new analysis (new theories, new analysis methods...) To cross-check their results Data to be preserved - FULL.DSTs, as they contain raw data and are the starting point to produce user level analysis files (DST/MDST) - most recent DST/MDST, for new analysis on existing stripping lines Software to be preserved We are considering different options, see next slide.

Long term data preservation: use cases 14 1) Read old DSTs (e.g. to make new ntuples with different variables): the latest analysis software (DaVinci) version can be used. 2) New stripping selection on a DST dataset: We need the DaVinci-stripping code corresponding to the latest stripping, so that the new selection can use existing stripping lines for the backgrounds. 3) New MC (due to new theory, decay mode, ) We need the trigger emulation corresponding to a given data taking period + latest reconstruction and stripping code. 4) Exactly reproduce an old analysis: all original DSTs and analysis software (DaVinci) version are required; user-level scripts and macros are also needed. NB: we are already performing software preservation for a specific application: trigger swimming, where old trigger code has to run on more recent DST.

LHCb open access policy (I) 15 LHCb open access policy approved last February Key points: Level-1 data: Published results All scientific results are public. Data associated with the results will also be made available; format and repositories will be decided by the Editorial Board Level-2 data: Outreach and education LHCb already involved in outreach and education activities. Event displays and simple analysis level ntuples are already available and will continue to be provided to the public. The data are for educational purpose only, not suitable for publication

LHCb open access policy (II) 16 Level-3 data: Reconstructed data LHCb will make reconstructed data (DST) available to open public; 50% 5 years after data is taken, 100% after 10 years. Associated software will be available as open source, together with existing documentation. Any publication that results from data analysis by non-members of the collaboration will require a suitable aknowledgement ( data was collected by LHCb ) and disclaimer ( no responsibility is taken by the collaboration for the results ). Level 4 data: Raw data Due to the complexity of the raw data processing stage, the extensive computing resources required and enormous access to tape resources, direct access to raw data is not permitted to individuals within the collaboration. Raw data processing is performed centrally. Due to this, the collaboration is currently not planning to allow open access to raw data.

Conclusions 17 We are analysing our computing model in view of long term preservation of data and analysis capabilities. First goal is to guarantee that all LHCb data can be accessed and analysed by the collaboration in the long term future; we are identifying the necessary requirements (e.g. new reconstruction and analysis software versions able to run on all data taking periods; preservation of trigger emulations for all data taking periods). In parallel, work on the open access is already ongoing: Outreach activities on subset of LHCb data (see Bolek's talk this morning) Open access policy official since February 2013.