Data handling and processing at the LHC experiments

1 Data handling and processing at the LHC experiments Astronomy and Bio-informatic Farida Fassi CC-IN2P3/CNRS EPAM 2011, Taza, Morocco

2 The presentation will be LHC centric, which is very relevant for the current phase that we are now less emphases will be given to Astronomy and Bio-informatic Narrowing the scope of the presentation to the perspective of the physicists, discussing issues that affects them directly Outlines Motivation and requirements Data management From Trigger up to offline passing by Condition Database Reprocessing chain Distributed Analysis Analysis Data flow End-user interfaces descriptions and Monitoring Data handling and processing aspects on Astronomy Brief introduction to Bio-informatic and grid computing

4 LHC to find the Higgs boson & new physics beyond the Standard Model Alice LHC CMS Nominal working condition p-p beams: s=14 TeV; L=10 34 cm -2 s -1 ; Bunch Cross every 25 ns Pb-Pb beams: s=5.5 TeV; L=10 27 cm -2 s -1 SPS ALICE dedicated to heavy-ion physics. Study of QCD under extreme conditions ATLAS PS LHCb 2010 s=7 TeV (first collisions on March, 30 th ) Peak L~10 32 cm -2 s -1 (Novembre) Recorded luminosity = 43.17 pb -1 First ion collisions recorded on November

5 LHC Data Challenge The LHC generates 40.10*6 collisions/s Combined the 4 experiments record: 100 interesting collision per second ~ 10 PB (10*16 B) per year (10*10 collisions/y) LHC data correspond to 20 10*6 DVD s /year! Space equivalent to 400.000 large PC disks Computing power ~ 105 of today s PC Using the parallelism or hierarchical architecture is the only way to analyze this amount of Data in a reasonable amount of time

6 The way LHC experiments uses the GRID Tier-0 Store RAW data Served RAW data to Tier1s Run first-pass calib/align Run first-pass reconstruction Data Distribution Tier1s Tier-1s Store RAW data (forever) Re-reconstruction Served copy of RECO Archival of Simulation Data Distribution Tier2s Tier-2s Primary Resources for Physics Analysis And Detectors Studies by users MC Simulation distribution Tier1s

7 LHC Computing Model: the Grid interfaces and main elements The LHC experiments Grid tools interface to all middleware types and provide uniform access to the Grid environment: The VOMS (Virtual Organization Membership Service) database contains the privileges of all collaboration members; it is used to allow collaboration jobs to run on experiment resources and store their output files on disks Distributed Data Management system catalogues all collaboration data and manages the data transfers The Production system schedules all organized data processing and simulation activities The tools interfaces allow the analysis job submission: jobs go the sites holding input data and output data to be stored locally or sent back to the submitting site Such a complex system is very powerful but presents challenges for ensuring quality Failures are expected and must be managed

9 Requirements for a Reconstruction Software The LHC collisions will occur@40mhz while the offline system can stream data to disk only at 150-300 Hz Offline Operation workflows: Trigger Strategy: trigger sequence in which, after a L1 (hardware based) response, reducing the events from 40 MHz to 100 khz, the offline reconstruction code runs to provide the factor 1000 reduction to 150-300Hz offline reconstruction must provide both: prompt feedback on detector status and data quality sample for physics analysis, provide up-to-date alignment & calibration calibration workflows with short latency provide samples for calibration purposes, data validation and certification for analysis data quality monitoring (DQM)

10 Before Tier0 Data are organized into inclusive streams, based on trigger chains: ~200Hz Physics streams ~20Hz Express streams ~20Hz Calibration/Monitoring streams Several streams are designed to handle calibration and alignment data efficiently Alignment and calibration payloads must be provided in a timely manner in order to proceed in the reconstruction chain. Luminosity only known for lumi sections Data split across multiple streamer files in lumi section

11 Trigger system LEVEL1 reduces rate from 40 MHz 100kHz hardware based on fast decision logic Uses only coarse reconstruction If trigger decision positive L1Accept L1Accept High Level Trigger HLT reduces rate 100 khz O(100 Hz) uses full detector data (including tracker data) event processing with programs running in a computer farm Reconstructs μ, e/γ, jets, Et,. subdivides processed data in data streams according to physics needs, calibration, alignment and Data quality LVL1 Trigger <100 khz High Level Triggers Software based LVL2 Trigger ~3 khz Event Filter ~200 Hz Coarse granularity data Calorimeter and Muon based Identifies Regions of Interest Partial event reconstruction in Regions of Interest Full granularity data Trigger algorithms optimized for fast rejection Full event reconstruction seeded by LVL2 Trigger algorithms similar to offline

12 What do we have to do with the data? First pass of data reconstruction is done at Tier-0 Software and calibration constants are updated ~ daily Express stream: -Subset of the physics data used to check the data quality, and calculate calibration constants Calibration streams: Partial events, used by specific subdetectors. Physics streams based on trigger Express Processing: Provide fully reconstructed of events within about 1 hour for monitoring and fast physics analysis Prompt Processing First pass reconstruction is performed on the RAW Physics datasets can be held up to 48h to allow PromptCalibration workflows to run and produce new conditions

13 Distributed Database: Conditions DB LHC data processing and analysis require access to large amounts of the non-event Data : detector conditions, calibrations, etc. stored in relational databases Conditions DB is critical for data reconstruction at CERN using alignment and calibration constants produced within 24 hours: the first pass processing Conditions which need continuous updates: beam-spot position measured every 23s tracker problematic channels Conditions which need monitoring: calorimeter problematic channels mask hot channels tracker alignment monitor movements of large structures LHC experiments use different technologies to replicate Conditions DB to all Tier1 sites via continuous real-time Updates

14 CERN Analysis Facility The CERN Analysis Facility (CAF) farm is dedicated to the LHC experiments latency critical activities like: Calibration and Alignment, Detector and/ortrigger Commissioning, or High Priority Physics Analysis CAF access is restricted to users dedicated to these activities CAF supported workflow The first workflow that is being supported is the beam spot determination The beam spot is the luminous region produced by the collisions of the LHC proton beams it needs to be measured precisely for a correct offline data reconstruction The data source is the Tier-0 for the beam spot workflow

15 ALICE and CMS Data type CMS hierarchy of Data Tiers Raw Data: as from the Detector Full Event: contains Raw plus all the objects created by the Reconstruction pass RECO: contains a subset of the Full Event, sufficient for reapplying calibrations after reprocessing Refitting but not re-tracking AOD: a subset of RECO, sufficient for the large majority of standard physics analyses Contains tracks, vertices etc and in general enough info to (for example) apply a different b-tagging Can contain very partial hit level information RAW RECO AOD ~1.5 MB/event ~ 500 kb/event ~ 100 kb/event ALICE has almost similar data type, content and format as CMS

16 ATLAS Data type RAW Event data from TDAQ ESD (Event Summary Data): output of reconstruction: Calorimeter cells, track hits, vertices, Particle ID, etc AOD (Analysis Object Data): physics objects for analysis such as e, µ, jets, etc DPD (Derived Physics Data): equivalent of old ntuples (format to be finalized) TAG Reduced set of information for event selection Collaboration production Group/user activity S RAW DPD

17 LHCb Data type Distribution to Tier1s (RAW) RAW Data is reconstructed Reconstruction (SDST) Calorimeter energy clusters Stripping and streaming (DST) Particle ID Group-level production (µdst) Tracks... At reconstruction only enough information is stored to allow a physics pre-selection to run at a later stage: stripping DST (SDST) User physics analysis performed on the stripped data Output of the stripping is self contained, i.e. no need to navigate through files Analysis generates semi-private data: ntuple and/or personal DST

18 Data Quality - Aims Knowledge of the quality of data underpins all particle physics results Only good data can be used to produce valid physics results Careful monitoring necessary to understand data conditions, diagnose and eliminate detector problems The Data Quality (DQ) system provides the means to: Allow experts and shifters to investigate data shortly after it is recorded in accessible formats, Derive calibrations and other necessary reconstruction parameters, Mask or fix any detector issues found Provide a calibrated set of processed physics event streams rapidly, Determine the data quality for each DQ region ( 100 in total) and the suitability of any run for physics analysis: Using a flag (good, bad, etc) Record these and allow data analysis teams to make selections on combinations of these flags conveniently

19 Data Reprocessing (1) When Software and/or Calibration constants get better collaborations need to organize data processing for physics groups in the most efficient way As the LHC experiments computing resources are on the Grid, reprocessing is managed by the central Production System Needs dedicated efforts to ensure high quality results Reconstruction results are input to additional physics -specific treatment by Physics working groups This step also requires massive data access and a lot of CPU, and in addition it often needs a rapid software update Reconstruct on the grid, produce and distribute bulk outputs to the collaboration for analysis required:

20 Data Reprocessing (2) Efficient usage of computing resources on the grid which needs a stable and flexible production system Full integration with Data Management system allows automated data delivery to the final destination Prevent bottlenecks in large-scale data access to conditions DB Exclude site-dependent failures, like unavailable resources

21 Monte Carlo (MC) Production MC production is crucial for detector studies and physics analysis Mainly used for identifying background and evaluating acceptances and efficiencies Event simulation and reconstruction is managed by the Central production System The production chain is: Generation: no input, small output (10 to 50 MB ntuples) pure CPU: few minutes, up to few hours if hard filtering present Simulation (hits): GEANT4 small input CPU and memory intensive: 24 to 48 hours large output: ~500 MB, the smallest is ~ 100 KB! Digitization: lower CPU/memory requirements: 5 to 10 hours I/O intensive: persistent reading of PU through LAN large output: similar to simulation Reconstruction: even less CPU: ~5 hours smaller output: ~200 MB

23 Data Analysis and LHC Analysis Flow The full data processing chain from reconstructed event data up to producing the final plots for publication Data analysis is an iterative process Reduce data samples to more interesting subsets (selection) Compute higher level information, redo some reconstruction, etc. Calculate statistical entities For the LHC experiments data is generated at the experiments, process and arrange in Tiers geographically distributed (T1, T2, T3) The analysis will process, reduce, transform and select parts of the data iteratively until it can fit in a single computer How this is realized?

24 From the user point of view The LHC experiments developed a number of experiment specific middleware using a small set of basic services (backends) E.g. DIRAC, PanDA, AliEn, Glide-In these special middleware allow the job To benefit from being run in the grid environment They Developed the user-friendly and intelligent interfaces to hide the complexity and provide the transparent usage of the distributed system E.g. CRAB, GANGA Allowing a large-scale data processing on distributed resources (Grid) [LHC experiment specific] Front-end interface LHC experiment specific software Grid middleware Basic Services Computing & Storage resources User Output

25 LHC experiments specific Framwork Specialization of the LHC experiments Frameworks and Data Models for data analysis to process ESD/AOD: CMS Physics Analysis Toolkit (PAT) ATLAS Analysis Framework, LHCb DaVinci/LoKi/Bender, ALICE Analysis Framework In same cases selecting subset of Framework libraries Collaboration approved analysis algorithms and tools User typically develops its own Algorithm(s) based on these frameworks but also is willing to replace parts of the official release

26 Distributed Data Analysis Flow Distributed analysis complicates the life of the physicists In addition to the analysis code he/she has to worry about many other technical issues The Distributed Analysis model is data location driven : the users analysis runs where data are located User runs interactively on small data sample developing the analysis code User selects large data sample to run the very same code User s analysis code is shipped to the site where sample is located Results are made available to the user for the final plot production Final analysis performs locally on Small cluster single computer

27 Front-End Tools Pathena/GANGA ALICE ATLAS Goal is to ensure users are able to efficiently access all available resources (local, batch, grid, etc) Easy job management and application configuration CRAB CMS

28 Input Data The user specifies on what data to run the analysis using the LHC experiments specific dataset catalogs Specification is based on a query The front-end interfaces provide functionality to facilitate the catalog queries Each experiment has developed Event Tags mechanisms for sparse input data selection TAG An important goal of TAG is enabling the storage of massive stores of raw data in central locations that have sufficiently capable storage, processing, and network infrastructure to handle it, while also permitting remote scientists to work with the data by using TAG metadata to select smaller-scale, higher-quality data that can feasibly be downloaded and processed at locations with more modest resources

30 Monitoring system Web monitoring is crucial feature both for users and administrator The LHC experiments developed Powerful and flexible monitoring system Activities: Follow specific analysis jobs and tasks Identify inefficiencies and failures Investigate inefficiencies and failures Commission sites and services Identify trends, predict future requirements Targets: Data transfers, Job and Task processing, Site and Service availability

31 Task Monitoring Dashboard generates a Wide selection of plots

32 Positive impact of monitoring on infrastructure quality Dashboard generates weekly reports with monitoring metrics related to data analysis on the GRID the LHC experiments takes action in order to improve the success rate of user analysis jobs. Successes Application Failures User configurations errors Remote stage out issues Few % of failures reading data at site Grid Failures

34 Astronomy with high-energy particles Astronomy aims to answer the following questions: What is the Universe made of? What are the properties of neutrinos? What is their role in cosmic evolution? What do neutrinos tell us about the interior of Sun and Earth, and about Supernova explosions? What's the origin of high energy cosmic rays? What's the sky view at extreme energies? Can we detect gravitational waves? What will they tell us about violent cosmic processes and basic physics laws?

35 Astronomy with high-energy particles Astrophysics Astroparticle physics Sources Messengers Stars (evolution), galaxies, clusters, CMBR Electromagnetic (radio, IR, VIS-UV, X-Ray) Supernova remnants, GRBs, AGNs, dark matter annihilations, Elementary particles (γ, ν, p, e) Datasets Image-based Event-based Detectors Optical / radio telescopes Particle telescopes

36 Astroparticle data Flow Signals from the detectors are digitized and packaged into events which then must undergo processing to reconstruct the physical meaning of the event Typically fast acquisition, a lot of storage needed, RAW + calibration data, post-processing of a selection of events (event by event) The typical steps in an experiment are: 1. Register passage of particle in detector element 2. Digitize the signals 3. Trigger on interesting signals 4. Readout detector elements and build into an event written to disk/tape 5. Perform higher level triggering/ filtering on events - perhaps long after they are recorded 6. Reconstruct the particle hypotheses - usually via non-linear fits 7. Statistical analysis of extracted observations

37 Astronomy and Grid computing Astronomy experiments produce petabytes of data, They have challenging goals for efficient access to this data Data reduction and analysis require lots of computing resources Must distribute data to all collaborators across Europe User access to shared resources and standardized analysis tool Better and easier data management Many Astronomy experiments have adopted Grid as a computing model and optimized their applications needed to extract a final result such as: Simulation Data processing, reconstruction Data transfer Storage Data analysis

38 Bio-informatic and Grid Formal representation of biological knowledge Maintenance of biological databases Simulations Molecular dynamics Biochemical pathways One of the major challenges for the bioinformatics community is to provide the means for biologists to analyse the sequences provided by the complete genome sequencing projects. Grid technology is an opportunity to normalize the access for an integrated exploitation allows to present software, servers and information systems with homogenous means.

39 Bio-informatic and Grid Gridification of the bio applications: Allowing distribution of large datasets over different sites and avoiding single points of failure or bottlenecks; Enforcing the use of common standards for data exchanges and making exchanges between sites easier; Enlarging the datasets available for large scale studies by breaking the barriers between remote sites; In addition Allowing a distributed community to share its computational resources so that a small laboratory can proceed with large scale experiments if needed; Opening new application fields that were not even thinkable without a common grid infrastructure

40 Summary LHC will provide access to conditions not seen since the early Universe Analysis of LHC data has potential to change how we view the world Substantial computing and sociological challenges The LHC will generate data on a scale not seen anywhere before LHC experiments will critically depend on parallel solutions to analyze their enormous amounts of data A lot of sophisticated data management tools have been developed Many Scientific applications benefit from the powerful grid computing to share resources used to obtain a scientific result

42 Major Differences Both Ganga and ALICE provide an interactive shell to configure and automate analysis jobs (Python, CINT) In addition Ganga provides a GUI Crab has a thin client. Most of the work (automation, recovery, monitoring, etc) is done in a server This functionality is delegated to the VO specific WMS for the other cases Ganga offers a convenient overview of all user jobs (job repository) enabling automation Both Crab and Ganga are able to pack local user libraries and environment automatically making use of the configuration tool knowledge For ALICE the user provides.par files with the sources