Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Overview Large Hadron Collider (LHC) Compact Muon Solenoid (CMS) experiment The Challenge Worldwide LHC Computing Grid (wlcg) Data Organisation Analysis Techniques Databases Future Trends 2

Large Hadron Collider a hadron is a composite particle made of quarks 3

Big machine characteristics 17 mile circular tunnel, 100m underground, straddling the French-Swiss border Protons currently travel at 99.9999964% of the speed of light Each proton enters CH over 11,000 times in a second Will not reach design beam energy till 2014 Interactions potentially every 25ns (40MHz) Each interaction has multiple collisions Call pileup, currently around 30 collisions per event 4

Accelerator Complex Older machines feed newer machines LHC Protons start in LINAC2 then go to the PS via the BOOSTER From the PS they are injected to the SPS Injected to LHC at 450GeV Accelerated to 4TeV in LHC Need to have fills ~1/day 5

CMS LHC SPS CERN Main Site 6

Compact Muon Spectrometer a muon is a (comparatively) long lived big brother to the electron 7

Particle Identification 101 9

Trigger Architecture Matching Trigger Towers ECAL, HCAL: E T (d d < 2.1 0.8 < < 2.4 < 1.2 Electron Isolation, Jet detection Track segments endcap and barrel Sorting E T miss E T tot for Endcap and Barrel: p T,,, quality 4 candidates Final decision, partitioning Interface to TTC, TTS (Trigger throttling system) 11

Data Rates RAW (ie unprocessed) data is about ~1MB/ev Potential detector acquisition rate 1MB * 40MHz = 40TB/s Actual data is much larger but all detectors not able to readout at 40MHz Hardware trigger decision allows 100kHz rate Looks at individual detectors to make a fast choice Data rate up to 100GB/s High Level Trigger done on filter farm Output rate is nominally 300Hz ~= 300MB/s 12

The Challenge why it isn t easy 13

A Higgs event 14

A Haystack 40 reconstructed vertices High PileUp run 25 th October 2011 15

Haystacks So that was one event 2012 average is 30 collisions per event By the end of 2012 will have almost 7 billion events recorded After the reduction of 40MHz to O(300Hz) Doesn t include simulated data Looking for a half million Higgs particles Assuming predicted cross sections are correct Many are much much harder to find than 4 muons 16

Worldwide LHC Computing Grid (wlcg) like an electric grid that supplies computing power 17

Tiered System Tier-0 at CERN Data gets sorted and its first pass reconstruction Tier-1 centres CMS has seven, large regional facilities Provide custodial tape storage Large scale re-reconstruction Tier-2 centres Frequently universities or groups of universities Simulation End user analysis 18

Schematic Filter Farm CMS Detector CERN Tier-0 Fermilab IN2P3 ASGC KIT CNAF Tier-1 Florida UCSD Tier-2 UCLA MyLaptop Tier-3 19

Traffic on a CERN Holiday LHCOPN (Optical Private Network) CMS is green 20

Resources CPU (khs06) Tier-2 324 56% 582kHS06~=150kSi2k Tier-0 121 21% Tier-1 137 23% Disk (TB) Tier-2 27000 51% Tape (TB) 51800TB Tier-1 21000 40% Tier-0 4800 9% 90000TB Tier-0 23000 33% Tier-1 47000 67% 21

Data Organisation lining up the bytes in a consumable order 22

Data Tiers Streamer files written to disk by filter farm Read and reorganised into Primary Datasets (PD) Based on trigger selections (physics motivation) Output is the custodial RAW data Reconstruction run on RAW PDs Output RECO and AOD (Analysis Object Data) Simulation also produces similar data tiers plus truth information 23

Data Ordering ROOT used as persistency framework Depending on expected reading pattern adjust ordering of data in files RAW & RECO expected to read whole event Ordering in file is by event 1 2 3 n AOD could have subset of data read Pass frequently over a single variable making plots 1 2 3 n 1 2 3 n Attribute 1 Attribute 4 24

Skims Train model like event selection Various analysis include their event selection Selection done using reco output More detailed and accurate than trigger info Can cut a lot harder First skims done at Tier-1 on the Tier-0 output Called PromptSkims as it is started ASAP Currently write out 81 datasets from Tier-0 output 25

Datasets Files are collected in datasets Datasets should be processed together This actually uses a database (Oracle) Each dataset has provenance attached to it Can be superseded by a reprocessing End user tool queries database and creates jobs to process it Typically across all the Tier-2s hosting the dataset 26

Analysis Techniques narrowing the haystacks 27

Discriminating Variables Each analysis will find the variables that enhance their signal to noise ratio High energy muon is an easy one i.e. something going really fast doesn t bend so much in the magnetic field May end up loosing a lot of signal to reduce the background by a larger factor Optimise S/ B or S/ (S+B) 60 50 40 30 20 10 0 Pseudo Data Momentum of muon (GeV) Background Signal 28

Multivariate Analysis Many different types Simple rectangular cuts (multiple 1-d cuts) Maximum Likelihood approaches Combine the probability of all input variables Fisher Discriminants Input variables are projected to another space to avoid correlations Neural Networks Most of these methods rely on training Some packages can apply many methods 29

TMVA (Toolkit for MVA in ROOT) 30

New Boson Plot H -> ZZ -> llll Use five angles and two masses as discriminators 31

Databases not xldbs though 32

Conditions Database Largest database use (not in size, ~300GB) Provides calibration, geometry and alignment information Used by all running jobs Can be more than 100k jobs world wide Network of squid caches used Database queues transformed into http requests Home grown technology to achieve this (Frontier) Works as data is written once, read many 33

Squids Aggregate: 500k requests/min 500MB/s Offline Servers: 4k requests/min 0.5MB/s 34

Other Databases PhEDEx : Manages file transfers Single Oracle instance at CERN DBS : Dataset Bookkeeping System Contains meta-data about datasets and files Main instance in Oracle at CERN User instances available elsewhere with MySQL Job tracking databases Use both Oracle and MySQL Recent system archiving information in CouchDB 35

Reading Rate 6TB/day 250TB/day 36

Future Trends need to wear shades 37

Federated Storage Aiming towards an architecture where all storage is visible globally Redirect EU Global Redirector US Region EU Region?? Region Query /store/foo User App Open /store/foo Open /store/foo Redirect Global Redirect Site C US Redirector Query /store/foo EU Redirector Query /store/foo Site A Site B Site C Site D /store/foo 38

Clouds: for a rainy day Helix Nebula European initiative to provide unified system Shows importance for standards Proof of concept demonstrated on Amazon Costs still prohibitively expensive Estimate order of magnitude Running our own data centres more cost effective May be interesting for adding short term capacity 39

Clouds: internal cloud CERN moving to agile infrastructure Commissioning new data centre in Hungary Filter farm as cloud during LHC shutdown Using OpenStack across 15k cores Allows flexibility for redeployment Farm also needed for detector work 40

Summary Database technology used in various roles Whole size around 10TB: not huge Our Big Data: 20PB RAW data CMS uses worldwide computing infrastructure to deliver physics results We ve found a needle, now need to figure out what kind it is: http://lanl.arxiv.org/abs/1207.7235 41

XLDB Europe 2013 @ CERN CERN will be happy to host a European Satellite XLDB Planned date: 25+26 June 2013 During LHC long shutdown, which will allow to include also discussions on LHC data management issues We invite everyone to help reaching out to places in Europe with challenging xldbrelated issues please contact dirk.duellmann@cern.ch and becla@slac.stanford.edu 42