ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

Similar documents
Analysis of Σ 0 baryon, or other particles, or detector outputs from the grid data at ALICE

1. Introduction. Outline

ATLAS Analysis Workshop Summary

Computing at Belle II

A time machine for the OCDB

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

The Run 2 ATLAS Analysis Event Data Model

Data handling and processing at the LHC experiments

Access to ATLAS Geometry and Conditions Databases

ATLAS PILE-UP AND OVERLAY SIMULATION

The CMS Computing Model

Offline Tutorial I. Małgorzata Janik Łukasz Graczykowski. Warsaw University of Technology

Machine Learning in Data Quality Monitoring

Open data and scientific reproducibility

Early experience with the Run 2 ATLAS analysis model

ALICE Run3/Run4 Computing Model simulation software

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

A L I C E Computing Model

Deep Learning Photon Identification in a SuperGranular Calorimeter

Prompt data reconstruction at the ATLAS experiment

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

DVCS software and analysis tutorial

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

IEPSAS-Kosice: experiences in running LCG site

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Distributed Data Management on the Grid. Mario Lassnig

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

PoS(High-pT physics09)036

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

First LHCb measurement with data from the LHC Run 2

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

The ATLAS Trigger Simulation with Legacy Software

Status of PID. PID in in Release Muon Identification Influence of of G4-Bug on on PID. BABAR Collaboration Meeting, Oct 1st 2005

NanoAODs Summer student report

Data preservation for the HERA experiments at DESY using dcache technology

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Update of the Computing Models of the WLCG and the LHC Experiments

Status of KISTI Tier2 Center for ALICE

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Virtualizing a Batch. University Grid Center

DQ2 - Data distribution with DQ2 in Atlas

Event reconstruction in STAR

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction.

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

LHCb Distributed Conditions Database

PROOF-Condor integration for ATLAS

Reprocessing DØ data with SAMGrid

An SQL-based approach to physics analysis

UW-ATLAS Experiences with Condor

Update of the Computing Models of the WLCG and the LHC Experiments

The BaBar Computing Model *

Data and Analysis preservation in LHCb

Track pattern-recognition on GPGPUs in the LHCb experiment

CLAS Data Management

Event Displays and LArg

CERN Open Data and Data Analysis Knowledge Preservation

The ATLAS Distributed Analysis System

Hall C Analyzer. Hall C Winter Collaboration Meeting. Eric Pooser 01/20/2017

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

LCIO: A Persistency Framework and Event Data Model for HEP. Steve Aplin, Jan Engels, Frank Gaede, Norman A. Graf, Tony Johnson, Jeremy McCormick

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Operational Aspects of Dealing with the Large BaBar Data Set

Computing Model Tier-2 Plans for Germany Relations to GridKa/Tier-1

150 million sensors deliver data. 40 million times per second

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Distributed Archive System for the Cherenkov Telescope Array

Data Base and Computing Resources Management

LHCb Computing Strategy

The ATLAS EventIndex: Full chain deployment and first operation

ATLAS Oracle database applications and plans for use of the Oracle 11g enhancements

Gamma-ray Large Area Space Telescope. Work Breakdown Structure

The GAP project: GPU applications for High Level Trigger and Medical Imaging

PhUSE Giuseppe Di Monaco, UCB BioSciences GmbH, Monheim, Germany

The ATLAS Production System

Performance quality monitoring system (PQM) for the Daya Bay experiment

Information Technology Engineers Examination. Database Specialist Examination. (Level 4) Syllabus. Details of Knowledge and Skills Required for

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

DIAL: Distributed Interactive Analysis of Large Datasets

Towards Reproducible Research Data Analyses in LHC Particle Physics

Experience with Data-flow, DQM and Analysis of TIF Data

Update on Energy Resolution of

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

CMS Analysis Workflow

SQL Azure as a Self- Managing Database Service: Lessons Learned and Challenges Ahead

Event cataloguing and other database applications in ATLAS

Deferred High Level Trigger in LHCb: A Boost to CPU Resource Utilization

CLAS12 Software Organization and Documentation

Monte Carlo Production on the Grid by the H1 Collaboration

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Experience with ATLAS MySQL PanDA database service

Monte Carlo programs

The performance of the ATLAS Inner Detector Trigger Algorithms in pp collisions at the LHC

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

Design of the new ATLAS Inner Tracker (ITk) for the High Luminosity LHC

Data handling with SAM and art at the NOνA experiment

Global Software Distribution with CernVM-FS

Long term data preservation and virtualization

Belle II Software and Computing

Transcription:

1 ALICE ANALYSIS PRESERVATION Mihaela Gheata DASPOS/DPHEP7 workshop

2 Outline ALICE data flow ALICE analysis Data & software preservation Open access and sharing analysis tools Conclusions

3 ALICE data flow DETECTOR ALGORITHMS RAW ROOTIFY ONLINE SHUTTLE Detector algorithms RAW.root CALIBRATION OCDB Calibration data RECONSTRU CTION AliESDs.root OFFLINE ALIEN + FILE CATALOG FILTERING QUALITY ASSURANCE AODB (analysis database) AliAOD.root ANALYSIS results.root

4 Filtering analysis input data Event summary data (ESD) produced by reconstruction Can be used as input by analysis Not efficient for distributed data processing (big size, a lot of info not relevant for analysis) Filtering is the procedure to create a lighter event format (AOD) starting from ESD data Reducing the size by a factor of >3 AOD files contain events organized in ROOT trees Only events coming from physics triggers Additional filters for specific physics channels (muons, jets, vertexing) producing AOD tree friends (adding branches) Big files (up to 1GB) First filtering runs in the same process with reconstruction Filtering can be redone any time on demand Backward compatibility: any old AOD file can be reprocessed with the current version

5 Base ingredients for analysis Code describing the AOD data format AOD event model (AliAODEvent) class library, containing tracks, vertices, cascades, PID probabilities, centrality, trigger decisions or physics selection flags AOD data sets Filtered reconstructed events embedding best knowledge of detector calibration Classified by year, collision type, accelerator period,... Run conditions table (RCT) All data relevant for the quality and conditions for each individual run and per detector The base for quality-based data selections Analysis database (OADB) Containing "calibrated" objects used by critical analysis services (centrality, particle identification, trigger analysis) + analysis specific configurations (cuts) The infrastructure allowing to run distributed analysis GRID, local clusters with access to data, PROOF farms,... Monte Carlo AOD's produced from simulated data Contains both the reconstructed information and the MC truth (kinematics) Almost 1/1 ratio with real data AODs, associated with anchor runs for each data set

6 What will we need to process data after 10+ years? ROOT library A minimal versioned lightweight library with the data format and analysis framework Allowing to process the ROOT trees in the AOD files The RCT and cross-links from the ALICE logbook (snapshot) Relevant AOD datasets (data & MC), produced by filtering most recent reconstruction passes, with the corresponding versioned OADB Relevant documentation, analysis examples and processing tools

7 Software preservation GPL or LGPL SW based on ROOT Assuming that ROOT will still "exist" and provide similar functionality (such as schema evolution) in 10-20 years Single SVN repository for the experiment code External packages: ROOT, GEANT3/4, event generators, AliEn GRID Several work packages bundled in separate libraries (per detector or analysis groups) Several production tags with documented changes Multiple platform support The documentation describes clearly the analysis procedure The provided documentation is generally enough for a newcomer doing analysis. The problems show up after the software is not maintained anymore

8 Data preservation for analysis If reprocessing is needed (improved reconstruction algorithms, major bug fixes, AOD data loss), primary data is needed Raw data, calibration data, conditions metadata Most recent AOD datasets The metadata related to analysis and run conditions Essential for dataset selection

9 ALICE analysis shareable tools Specific framework allowing to process main data types (AOD, ESD, MC kinematics) Organized in "trains" processing common data LEGO-like analysis modules created by users via a simple web interface ALICE internal usage, too complex for outreach

10 Level 3 preservation and open access A replication of the current analysis LEGO services on dedicated resources is possible Technically feasible in short time (on clusters or clouds) The only problems are related to partitioning data and services for external usage (in case of sharing GRID with ALICE users) The procedure to process ALICE data very well documented Several examples and walkthroughs, tutorials External code can be compiled on the fly in our framework Simple examples on how to pack in tarballs The resources to be assigned for open access still to be assessed by management Like other important details: when, which data, procedure, access control, outcome validation, publication rules...

11 Data access and sharing No external sharing foreseen for RAW, calibration and conditions data For analysis formats, a simplification is considered Both for data format and processing SW Sharing of data started for level 2 (small samples used for outreach) Not yet for other levels To be addressed during the SW upgrade process Time for public data and SW releases not defined at this point

12 Conclusions ALICE is aware on the importance of long term data preservation A medium to long plan to be accommodated in the ambitious upgrade projects We are already investigating technical solutions allowing to reproduce results and re-run analysis in 10+ years Using existing tools, but also others (e.g. virtualization technology) Sharing of data is possible, but the implementation depends on the availability of resources Practical sharing conditions not considered yet

BACKUP SLIDES 13

14 Raw and calibration data Raw files are "rootified" on the global data concentrators Copied in CASTOR then replicated in T1 - tape /alice/data/<yyyy>/<period>/<run%09>/raw/<yy><run%09><gdc_ no>.<seq>.root All runs copied, but not all reconstructed Some runs marked as bad The ALICE logbook contains metadata and important bservations related to data taking for each run Primary calibration data extracted from online by the "Shuttle" program Running detector algorithms (DA) and copying calibration to OCDB

15 Calibration strategy CPass0 + CPass1 Chunk 0 Alien job CPas s0 Chunk 1 Alien job Chunk N Alien job Chunk 0 Alien job CPas s1 Chunk 1 Alien job Chunk N Alien job Alien job snapshot OCDB Alien job snapshot reconstruction reconstruction ESDs ESDs Calib. train Calib. train QA train Merging + OCDB entires OCDB Update Merging + OCDB entires OCDB Update

16 Reconstruction Input: raw data + OCDB Output: event summary data (ESD) Copied in at least 2 locations (disk) Content: ALICE event model in ROOT trees Events, tracks with extended PID info, vertices, cascades, V0's,... Trigger information and physics selection masks, detector specific information (clusters, tracklets) ESD format used as input by filtering procedures and analysis ESD files can always be reproduced using latest software+calibration -> reconstruction passes Backward compatibility (old data always readable)