Data handling and processing at the LHC experiments

Similar documents
The CMS Computing Model

Prompt data reconstruction at the ATLAS experiment

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Computing at Belle II

Virtualizing a Batch. University Grid Center

ATLAS PILE-UP AND OVERLAY SIMULATION

LHCb Computing Strategy

1. Introduction. Outline

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

PROOF-Condor integration for ATLAS

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

First LHCb measurement with data from the LHC Run 2

UW-ATLAS Experiences with Condor

Machine Learning in Data Quality Monitoring

The ATLAS Distributed Analysis System

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

IEPSAS-Kosice: experiences in running LCG site

Andrea Sciabà CERN, Switzerland

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

HEP replica management

Update of the Computing Models of the WLCG and the LHC Experiments

Data and Analysis preservation in LHCb

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

Experience with Data-flow, DQM and Analysis of TIF Data

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Installation of CMSSW in the Grid DESY Computing Seminar May 17th, 2010 Wolf Behrenhoff, Christoph Wissing

CMS Computing Model with Focus on German Tier1 Activities

CMS Alignement and Calibration workflows: lesson learned and future plans

ATLAS Analysis Workshop Summary

A L I C E Computing Model

TAG Based Skimming In ATLAS

Summary of the LHC Computing Review

The Global Grid and the Local Analysis

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

LHC Computing Models

CouchDB-based system for data management in a Grid environment Implementation and Experience

Distributed Data Management on the Grid. Mario Lassnig

The ATLAS EventIndex: Full chain deployment and first operation

Introduction to Grid Computing

Reprocessing DØ data with SAMGrid

The CMS data quality monitoring software: experience and future prospects

CSCS CERN videoconference CFD applications

Data Reconstruction in Modern Particle Physics

Computing at the Large Hadron Collider. Frank Würthwein. Professor of Physics University of California San Diego November 15th, 2013

Early experience with the Run 2 ATLAS analysis model

The LHC Computing Grid

LHCb Distributed Conditions Database

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

Deferred High Level Trigger in LHCb: A Boost to CPU Resource Utilization

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction.

The Database Driven ATLAS Trigger Configuration System

Data Quality Monitoring Display for ATLAS experiment

PoS(EPS-HEP2017)523. The CMS trigger in Run 2. Mia Tosi CERN

DQ2 - Data distribution with DQ2 in Atlas

Experience of the WLCG data management system from the first two years of the LHC data taking

CERN and Scientific Computing

HammerCloud: A Stress Testing System for Distributed Analysis

Physics CMS Muon High Level Trigger: Level 3 reconstruction algorithm development and optimization

Tracking and flavour tagging selection in the ATLAS High Level Trigger

Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

b-jet identification at High Level Trigger in CMS

Track reconstruction with the CMS tracking detector

arxiv: v1 [physics.ins-det] 1 Oct 2009

Managing Petabytes of data with irods. Jean-Yves Nief CC-IN2P3 France

CMS Note Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Design of the new ATLAS Inner Tracker (ITk) for the High Luminosity LHC

DIRAC pilot framework and the DIRAC Workload Management System

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

Clustering and Reclustering HEP Data in Object Databases

Gamma-ray Large Area Space Telescope. Work Breakdown Structure

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Monte Carlo Production on the Grid by the H1 Collaboration

DESY at the LHC. Klaus Mőnig. On behalf of the ATLAS, CMS and the Grid/Tier2 communities

The performance of the ATLAS Inner Detector Trigger Algorithms in pp collisions at the LHC

ATLAS, CMS and LHCb Trigger systems for flavour physics

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Real-time dataflow and workflow with the CMS tracker data

The LHC Computing Grid

Computing Resources Scrutiny Group

The Grid: Processing the Data from the World s Largest Scientific Machine

Overview of ATLAS PanDA Workload Management

HEP Grid Activities in China

Event Displays and LArg

RUSSIAN DATA INTENSIVE GRID (RDIG): CURRENT STATUS AND PERSPECTIVES TOWARD NATIONAL GRID INITIATIVE

Offline Tutorial I. Małgorzata Janik Łukasz Graczykowski. Warsaw University of Technology

ANSE: Advanced Network Services for [LHC] Experiments

Muon Reconstruction and Identification in CMS

Analysis & Tier 3s. Amir Farbin University of Texas at Arlington

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Transcription:

1 Data handling and processing at the LHC experiments Astronomy and Bio-informatic Farida Fassi CC-IN2P3/CNRS EPAM 2011, Taza, Morocco

2 The presentation will be LHC centric, which is very relevant for the current phase that we are now less emphases will be given to Astronomy and Bio-informatic Narrowing the scope of the presentation to the perspective of the physicists, discussing issues that affects them directly Outlines Motivation and requirements Data management From Trigger up to offline passing by Condition Database Reprocessing chain Distributed Analysis Analysis Data flow End-user interfaces descriptions and Monitoring Data handling and processing aspects on Astronomy Brief introduction to Bio-informatic and grid computing

3

4 LHC to find the Higgs boson & new physics beyond the Standard Model Alice LHC CMS Nominal working condition p-p beams: s=14 TeV; L=10 34 cm -2 s -1 ; Bunch Cross every 25 ns Pb-Pb beams: s=5.5 TeV; L=10 27 cm -2 s -1 SPS ALICE dedicated to heavy-ion physics. Study of QCD under extreme conditions ATLAS PS LHCb 2010 s=7 TeV (first collisions on March, 30 th ) Peak L~10 32 cm -2 s -1 (Novembre) Recorded luminosity = 43.17 pb -1 First ion collisions recorded on November

5 LHC Data Challenge The LHC generates 40.10*6 collisions/s Combined the 4 experiments record: 100 interesting collision per second ~ 10 PB (10*16 B) per year (10*10 collisions/y) LHC data correspond to 20 10*6 DVD s /year! Space equivalent to 400.000 large PC disks Computing power ~ 105 of today s PC Using the parallelism or hierarchical architecture is the only way to analyze this amount of Data in a reasonable amount of time

6 The way LHC experiments uses the GRID Tier-0 Store RAW data Served RAW data to Tier1s Run first-pass calib/align Run first-pass reconstruction Data Distribution Tier1s Tier-1s Store RAW data (forever) Re-reconstruction Served copy of RECO Archival of Simulation Data Distribution Tier2s Tier-2s Primary Resources for Physics Analysis And Detectors Studies by users MC Simulation distribution Tier1s

7 LHC Computing Model: the Grid interfaces and main elements The LHC experiments Grid tools interface to all middleware types and provide uniform access to the Grid environment: The VOMS (Virtual Organization Membership Service) database contains the privileges of all collaboration members; it is used to allow collaboration jobs to run on experiment resources and store their output files on disks Distributed Data Management system catalogues all collaboration data and manages the data transfers The Production system schedules all organized data processing and simulation activities The tools interfaces allow the analysis job submission: jobs go the sites holding input data and output data to be stored locally or sent back to the submitting site Such a complex system is very powerful but presents challenges for ensuring quality Failures are expected and must be managed

8

9 Requirements for a Reconstruction Software The LHC collisions will occur@40mhz while the offline system can stream data to disk only at 150-300 Hz Offline Operation workflows: Trigger Strategy: trigger sequence in which, after a L1 (hardware based) response, reducing the events from 40 MHz to 100 khz, the offline reconstruction code runs to provide the factor 1000 reduction to 150-300Hz offline reconstruction must provide both: prompt feedback on detector status and data quality sample for physics analysis, provide up-to-date alignment & calibration calibration workflows with short latency provide samples for calibration purposes, data validation and certification for analysis data quality monitoring (DQM)

10 Before Tier0 Data are organized into inclusive streams, based on trigger chains: ~200Hz Physics streams ~20Hz Express streams ~20Hz Calibration/Monitoring streams Several streams are designed to handle calibration and alignment data efficiently Alignment and calibration payloads must be provided in a timely manner in order to proceed in the reconstruction chain. Luminosity only known for lumi sections Data split across multiple streamer files in lumi section

11 Trigger system LEVEL1 reduces rate from 40 MHz 100kHz hardware based on fast decision logic Uses only coarse reconstruction If trigger decision positive L1Accept L1Accept High Level Trigger HLT reduces rate 100 khz O(100 Hz) uses full detector data (including tracker data) event processing with programs running in a computer farm Reconstructs μ, e/γ, jets, Et,. subdivides processed data in data streams according to physics needs, calibration, alignment and Data quality LVL1 Trigger <100 khz High Level Triggers Software based LVL2 Trigger ~3 khz Event Filter ~200 Hz Coarse granularity data Calorimeter and Muon based Identifies Regions of Interest Partial event reconstruction in Regions of Interest Full granularity data Trigger algorithms optimized for fast rejection Full event reconstruction seeded by LVL2 Trigger algorithms similar to offline

12 What do we have to do with the data? First pass of data reconstruction is done at Tier-0 Software and calibration constants are updated ~ daily Express stream: -Subset of the physics data used to check the data quality, and calculate calibration constants Calibration streams: Partial events, used by specific subdetectors. Physics streams based on trigger Express Processing: Provide fully reconstructed of events within about 1 hour for monitoring and fast physics analysis Prompt Processing First pass reconstruction is performed on the RAW Physics datasets can be held up to 48h to allow PromptCalibration workflows to run and produce new conditions

13 Distributed Database: Conditions DB LHC data processing and analysis require access to large amounts of the non-event Data : detector conditions, calibrations, etc. stored in relational databases Conditions DB is critical for data reconstruction at CERN using alignment and calibration constants produced within 24 hours: the first pass processing Conditions which need continuous updates: beam-spot position measured every 23s tracker problematic channels Conditions which need monitoring: calorimeter problematic channels mask hot channels tracker alignment monitor movements of large structures LHC experiments use different technologies to replicate Conditions DB to all Tier1 sites via continuous real-time Updates

14 CERN Analysis Facility The CERN Analysis Facility (CAF) farm is dedicated to the LHC experiments latency critical activities like: Calibration and Alignment, Detector and/ortrigger Commissioning, or High Priority Physics Analysis CAF access is restricted to users dedicated to these activities CAF supported workflow The first workflow that is being supported is the beam spot determination The beam spot is the luminous region produced by the collisions of the LHC proton beams it needs to be measured precisely for a correct offline data reconstruction The data source is the Tier-0 for the beam spot workflow

15 ALICE and CMS Data type CMS hierarchy of Data Tiers Raw Data: as from the Detector Full Event: contains Raw plus all the objects created by the Reconstruction pass RECO: contains a subset of the Full Event, sufficient for reapplying calibrations after reprocessing Refitting but not re-tracking AOD: a subset of RECO, sufficient for the large majority of standard physics analyses Contains tracks, vertices etc and in general enough info to (for example) apply a different b-tagging Can contain very partial hit level information RAW RECO AOD ~1.5 MB/event ~ 500 kb/event ~ 100 kb/event ALICE has almost similar data type, content and format as CMS

16 ATLAS Data type RAW Event data from TDAQ ESD (Event Summary Data): output of reconstruction: Calorimeter cells, track hits, vertices, Particle ID, etc AOD (Analysis Object Data): physics objects for analysis such as e, µ, jets, etc DPD (Derived Physics Data): equivalent of old ntuples (format to be finalized) TAG Reduced set of information for event selection Collaboration production Group/user activity S RAW DPD

17 LHCb Data type Distribution to Tier1s (RAW) RAW Data is reconstructed Reconstruction (SDST) Calorimeter energy clusters Stripping and streaming (DST) Particle ID Group-level production (µdst) Tracks... At reconstruction only enough information is stored to allow a physics pre-selection to run at a later stage: stripping DST (SDST) User physics analysis performed on the stripped data Output of the stripping is self contained, i.e. no need to navigate through files Analysis generates semi-private data: ntuple and/or personal DST

18 Data Quality - Aims Knowledge of the quality of data underpins all particle physics results Only good data can be used to produce valid physics results Careful monitoring necessary to understand data conditions, diagnose and eliminate detector problems The Data Quality (DQ) system provides the means to: Allow experts and shifters to investigate data shortly after it is recorded in accessible formats, Derive calibrations and other necessary reconstruction parameters, Mask or fix any detector issues found Provide a calibrated set of processed physics event streams rapidly, Determine the data quality for each DQ region ( 100 in total) and the suitability of any run for physics analysis: Using a flag (good, bad, etc) Record these and allow data analysis teams to make selections on combinations of these flags conveniently

19 Data Reprocessing (1) When Software and/or Calibration constants get better collaborations need to organize data processing for physics groups in the most efficient way As the LHC experiments computing resources are on the Grid, reprocessing is managed by the central Production System Needs dedicated efforts to ensure high quality results Reconstruction results are input to additional physics -specific treatment by Physics working groups This step also requires massive data access and a lot of CPU, and in addition it often needs a rapid software update Reconstruct on the grid, produce and distribute bulk outputs to the collaboration for analysis required:

20 Data Reprocessing (2) Efficient usage of computing resources on the grid which needs a stable and flexible production system Full integration with Data Management system allows automated data delivery to the final destination Prevent bottlenecks in large-scale data access to conditions DB Exclude site-dependent failures, like unavailable resources

21 Monte Carlo (MC) Production MC production is crucial for detector studies and physics analysis Mainly used for identifying background and evaluating acceptances and efficiencies Event simulation and reconstruction is managed by the Central production System The production chain is: Generation: no input, small output (10 to 50 MB ntuples) pure CPU: few minutes, up to few hours if hard filtering present Simulation (hits): GEANT4 small input CPU and memory intensive: 24 to 48 hours large output: ~500 MB, the smallest is ~ 100 KB! Digitization: lower CPU/memory requirements: 5 to 10 hours I/O intensive: persistent reading of PU through LAN large output: similar to simulation Reconstruction: even less CPU: ~5 hours smaller output: ~200 MB

22

23 Data Analysis and LHC Analysis Flow The full data processing chain from reconstructed event data up to producing the final plots for publication Data analysis is an iterative process Reduce data samples to more interesting subsets (selection) Compute higher level information, redo some reconstruction, etc. Calculate statistical entities For the LHC experiments data is generated at the experiments, process and arrange in Tiers geographically distributed (T1, T2, T3) The analysis will process, reduce, transform and select parts of the data iteratively until it can fit in a single computer How this is realized?

24 From the user point of view The LHC experiments developed a number of experiment specific middleware using a small set of basic services (backends) E.g. DIRAC, PanDA, AliEn, Glide-In these special middleware allow the job To benefit from being run in the grid environment They Developed the user-friendly and intelligent interfaces to hide the complexity and provide the transparent usage of the distributed system E.g. CRAB, GANGA Allowing a large-scale data processing on distributed resources (Grid) [LHC experiment specific] Front-end interface LHC experiment specific software Grid middleware Basic Services Computing & Storage resources User Output

25 LHC experiments specific Framwork Specialization of the LHC experiments Frameworks and Data Models for data analysis to process ESD/AOD: CMS Physics Analysis Toolkit (PAT) ATLAS Analysis Framework, LHCb DaVinci/LoKi/Bender, ALICE Analysis Framework In same cases selecting subset of Framework libraries Collaboration approved analysis algorithms and tools User typically develops its own Algorithm(s) based on these frameworks but also is willing to replace parts of the official release

26 Distributed Data Analysis Flow Distributed analysis complicates the life of the physicists In addition to the analysis code he/she has to worry about many other technical issues The Distributed Analysis model is data location driven : the users analysis runs where data are located User runs interactively on small data sample developing the analysis code User selects large data sample to run the very same code User s analysis code is shipped to the site where sample is located Results are made available to the user for the final plot production Final analysis performs locally on Small cluster single computer

27 Front-End Tools Pathena/GANGA ALICE ATLAS Goal is to ensure users are able to efficiently access all available resources (local, batch, grid, etc) Easy job management and application configuration CRAB CMS

28 Input Data The user specifies on what data to run the analysis using the LHC experiments specific dataset catalogs Specification is based on a query The front-end interfaces provide functionality to facilitate the catalog queries Each experiment has developed Event Tags mechanisms for sparse input data selection TAG An important goal of TAG is enabling the storage of massive stores of raw data in central locations that have sufficiently capable storage, processing, and network infrastructure to handle it, while also permitting remote scientists to work with the data by using TAG metadata to select smaller-scale, higher-quality data that can feasibly be downloaded and processed at locations with more modest resources

29

30 Monitoring system Web monitoring is crucial feature both for users and administrator The LHC experiments developed Powerful and flexible monitoring system Activities: Follow specific analysis jobs and tasks Identify inefficiencies and failures Investigate inefficiencies and failures Commission sites and services Identify trends, predict future requirements Targets: Data transfers, Job and Task processing, Site and Service availability

31 Task Monitoring Dashboard generates a Wide selection of plots

32 Positive impact of monitoring on infrastructure quality Dashboard generates weekly reports with monitoring metrics related to data analysis on the GRID the LHC experiments takes action in order to improve the success rate of user analysis jobs. Successes Application Failures User configurations errors Remote stage out issues Few % of failures reading data at site Grid Failures

33

34 Astronomy with high-energy particles Astronomy aims to answer the following questions: What is the Universe made of? What are the properties of neutrinos? What is their role in cosmic evolution? What do neutrinos tell us about the interior of Sun and Earth, and about Supernova explosions? What's the origin of high energy cosmic rays? What's the sky view at extreme energies? Can we detect gravitational waves? What will they tell us about violent cosmic processes and basic physics laws?

35 Astronomy with high-energy particles Astrophysics Astroparticle physics Sources Messengers Stars (evolution), galaxies, clusters, CMBR Electromagnetic (radio, IR, VIS-UV, X-Ray) Supernova remnants, GRBs, AGNs, dark matter annihilations, Elementary particles (γ, ν, p, e) Datasets Image-based Event-based Detectors Optical / radio telescopes Particle telescopes

36 Astroparticle data Flow Signals from the detectors are digitized and packaged into events which then must undergo processing to reconstruct the physical meaning of the event Typically fast acquisition, a lot of storage needed, RAW + calibration data, post-processing of a selection of events (event by event) The typical steps in an experiment are: 1. Register passage of particle in detector element 2. Digitize the signals 3. Trigger on interesting signals 4. Readout detector elements and build into an event written to disk/tape 5. Perform higher level triggering/ filtering on events - perhaps long after they are recorded 6. Reconstruct the particle hypotheses - usually via non-linear fits 7. Statistical analysis of extracted observations

37 Astronomy and Grid computing Astronomy experiments produce petabytes of data, They have challenging goals for efficient access to this data Data reduction and analysis require lots of computing resources Must distribute data to all collaborators across Europe User access to shared resources and standardized analysis tool Better and easier data management Many Astronomy experiments have adopted Grid as a computing model and optimized their applications needed to extract a final result such as: Simulation Data processing, reconstruction Data transfer Storage Data analysis

38 Bio-informatic and Grid Formal representation of biological knowledge Maintenance of biological databases Simulations Molecular dynamics Biochemical pathways One of the major challenges for the bioinformatics community is to provide the means for biologists to analyse the sequences provided by the complete genome sequencing projects. Grid technology is an opportunity to normalize the access for an integrated exploitation allows to present software, servers and information systems with homogenous means.

39 Bio-informatic and Grid Gridification of the bio applications: Allowing distribution of large datasets over different sites and avoiding single points of failure or bottlenecks; Enforcing the use of common standards for data exchanges and making exchanges between sites easier; Enlarging the datasets available for large scale studies by breaking the barriers between remote sites; In addition Allowing a distributed community to share its computational resources so that a small laboratory can proceed with large scale experiments if needed; Opening new application fields that were not even thinkable without a common grid infrastructure

40 Summary LHC will provide access to conditions not seen since the early Universe Analysis of LHC data has potential to change how we view the world Substantial computing and sociological challenges The LHC will generate data on a scale not seen anywhere before LHC experiments will critically depend on parallel solutions to analyze their enormous amounts of data A lot of sophisticated data management tools have been developed Many Scientific applications benefit from the powerful grid computing to share resources used to obtain a scientific result

41

42 Major Differences Both Ganga and ALICE provide an interactive shell to configure and automate analysis jobs (Python, CINT) In addition Ganga provides a GUI Crab has a thin client. Most of the work (automation, recovery, monitoring, etc) is done in a server This functionality is delegated to the VO specific WMS for the other cases Ganga offers a convenient overview of all user jobs (job repository) enabling automation Both Crab and Ganga are able to pack local user libraries and environment automatically making use of the configuration tool knowledge For ALICE the user provides.par files with the sources