Data Handling for LHC: Plans and Reality Tony Cass Leader, Database Services Group Information Technology Department 11 th July 2012 1
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 2
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 3
Familiar, but not Fundamental Periodic Table courtesy of wikipedia 4
5 5
The Standard Model Fundamental and well tested, but Why do particles have mass? Why is there no antimatter? Are these the only particles? 4 th Generation? LEP Discovery Do Fermions have bosonic partners and vice-versa? How does Gravity fit in? 6
Other interesting questions How do quarks LEP and gluons behave at ultra-high temperatures and densities? LHC What is dark matter? Supersymmetric particles? 7
How to find the answers? Smash things together! Images courtesy of hyperphysics 8
CERN Methodology The fastest racetrack on the planet Trillions of protons will race around the 27km ring in opposite directions over 11,000 times a second, travelling at 99.999999991 per cent the speed of light. 9
Energy of a 1TeV Proton 10 10
Energy of 7TeV Beams Two nominal beams together can melt ~1,000kg of copper. Current beams: ~100kg of copper. 11 11
CERN Methodology The emptiest space in the solar system To accelerate protons to almost the speed of light requires a vacuum as empty as interplanetary space. There is 10 times more atmosphere on the moon than there will be in the LHC. 12
CERN Methodology One of the coldest places in the universe With an operating temperature of about -271 degrees Celsius, just 1.9 degrees above absolute zero, the LHC is colder than outer space. 13
CERN Methodology The hottest spots in the galaxy When two beams of protons collide, they will generate temperatures 1000 million times hotter than the heart of the sun, but in a minuscule space. 14
CERN Methodology The biggest most sophisticated detectors ever built To sample and record the debris from up to 600 million proton collisions per second, scientists are building gargantuan devices that measure particles with micron precision. 15
Compact Detectors! 16
17
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 19
We are looking for rare events! number of events = Luminosity Cross section 2010 Luminosity: 45pb -1 70 billion pb à 3 trillion events! * * N.B. only a very small fraction saved! ~250x more events to date Higgs (m H =120 GeV) : 17 pb à 750 events 7 e.g. potentially ~1 Higgs in every 300 billion interactions! Emily Nurse ATLAS 20
So the four LHC Experiments ATLAS - General purpose - Origin of mass - Supersymmetry - 2,000 scientists from 34 countries CMS - General purpose - Origin of mass - Supersymmetry - 1,800 scientists from over 150 institutes ALICE LHCb - heavy ion collisions, to create quark-gluon plasmas - to study the differences between matter and antimatter - 50,000 particles in each collision - will detect over 100 million b and b-bar mesons each year 21
So the four LHC Experiments 22
generate lots of data The accelerator generates 40 million particle collisions (events) every second at the centre of each of the four experiments detectors 23
generate lots of data CASTOR data wri.en, reduced by online computers to01/01/2010 to 29/6/2012 (in PB) 60 a few hundred good events per 50 second. 40 ATLAS Zà μμ event from 2012 data with 25 reconstructed vertices USER NTOF NA61 NA48 LHCB 30 COMPASS CMS 20 ATLAS Which are recorded on disk and magnetic tape Zà μμ at 100-1,000 MegaBytes/sec ~15 PetaBytes per year 0 for all four experiments 10 AMS ALICE Current forecast ~ 23-25 PB / year, 100-120M files / year ~ 20-25K 1 TB tapes / year Archive will need to store 0.1 EB in 2014, ~1Billion files in 2015 24
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 25
What is the technique? Break up a Massive Data Set 26
What is the technique? into lots of small pieces and distribute them around the world 27
What is the technique? analyse in parallel 28
What is the technique? gather the results 29
What is the technique? a and discover the Higgs boson: Nice result, but is it novel? 30
Is it Novel? Maybe not novel as such, but the implementation is Terrascale computing that is widely appreciated! 31
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 32
Requirements! Computing Challenges Summary of Computing Resource Requirements All experiments - 2008 From 100,000 LCG TDR - June PCs 2005 15PB/year to tape CERN All Tier-1s All Tier-2s Total CPU (MSPECint2000s) 25 56 61 142 Disk O(100PB) (PetaBytes) disk cache 7 31 19 57 Tape (PetaBytes) 18 35 53 Worldwide Collaboration A Problem and a Solution Tier1s 4,000HS06 = 1MSPECint2000 33
Timely Technology! The WLCG project deployed to meet LHC computing needs. The EDG and EGEE projects organised development in Europe. (OSG and others in the US.) The Grid 34
Compute Element Grid Middleware Basics Standard interface to local workload management systems (batch scheduler) Storage Element Standard interface to local mass storage systems Resource Broker Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability. Many implementations of the basic principles: Globus, VDT, EDG/EGEE, NorduGrid, OSG 35
Job Scheduling in Practice Issue Grid sites generally want to maintain a high average CPU utilisation; easiest to do this if there is a local queue of work to select from when another job ends. Users are generally interested in turnround times as well as job throughput. Turnround is reduced if jobs are held centrally until a processing slot is known to be free at a target site. Solution: Pilot job frameworks. Per-experiment code submits a job which chooses a work unit to run from a per-experiment queue when it is allocated an execution slot at a site. Pilot job frameworks separate out site responsibility for allocating CPU resources from Experiment responsibility for allocating priority between different research sub-groups. 36 36
Data Issues Reception and long-term storage Delivery for processing and export Distribution Metadata distribution 700MB/s 700MB/s 420MB/s 2600MB/s (3600MB/s) (>4000MB/s) 1430MB/s Scheduled work only and we need ability to support 2x for recovery! 37
(Mass) Storage Systems After evaluation of commercial alternatives in the late 1990s, two tape-capable Mass storage systems have been developed for HEP: CASTOR: an integrated mass storage system dcache: a disk pool manager that interfaces to multiple tape archives (Enstore @ FNAL, IBM s TSM) dcache is also used a basic disk storage manager Tier2s along with the simpler DPM 38
A Word About Tape Our data set may be massive, but 35 30 25 20 15 10 5 0 CERN Archive file size distribution, in % ~195MB average only increasing slowly after LHC startup! It is made up of many small files Drive write performance, CASTOR tape format (ANSI AUL) 120000 100000 which is bad for tape speeds: Write speed (KB/s) 80000 60000 40000 20000 0-500 0 500 1000 1500 2000 2500 IBM AUL SUN AUL Average write drive speed: < 40MB/s (cf native drive speeds: 120-160MB/s) Small increases with new drive generations -20000 file size (MB) 39
Tape Drive Efficiency So we have to change tape writing policy Drive write performance, buffered vs nonbuffered tape marks 140 Average drive performance (MB/s) for CERN Archive files 120 120 100 100 speed, MB/s 80 60 CASTOR present (3sync/file) CASTOR new (1sync/ file) 80 60 40 CASTOR future (1 sync / 4GB) 40 20 20 0 0 200 400 600 file size, MB 0 3 sync/file 1 sync/file 1 sync / 4GB 40
Users aren t the only writers! Bulk data storage requires space! Fortunately 25000 Tape capacity Repack in 1 will year: continue to double every 2-3yrs 35 & 50TB @ 35 MB/s tape demonstrations in 2010 CERN has ~50K slots: ~0.25EB with new T10KC cartridges 20000 Unfortunately You have to copy data Repack from in 1 year: 15000 old cartridges to 500M-1G new or ~28 drives you run out of slots @ 63 MB/s 100M-500M 10M-100M Data rates for repack will soon Repack exceed in 1 year: LHC rates drive / days 10000 Repack in 1 year small files (<500M) 2012: 55PB = 1.7GB/s sustained 2015: 120PB = 3.8GB/s sustained 5000 time to migrate 55 PB (2012), drive/days, by file size ~55 drives ~18 drives @ 104 MB/s C.f. PP LHC rates: ~0.7GB/s; PbPb peak rate of 2.5GB/s And! 0 All LEP data fits on ~150 cartridges, or 30 new T10KCs 3 TM / file 1 TM / file TM / 4GB Automatic data duplication becomes a necessity >2G 1G-2G 1M-10M 100K-1M 10K-100K <10K 41 41
Media Verification Data in the archive cannot just be written and forgotten about. Q: can you retrieve my file? A: let me check err, sorry, we lost it. Proactive and regular verification of archive data required Ensure cartridges can be mounted Ensure data can be read and verified against metadata (checksum, size, ) Do not wait until media migration to detect problems Opportunistic scanning when resources available
Storage vs Recall Efficiency Efficient data acceptance: Have lots of input streams, spread across a number of storage servers, wait until the storage servers are ~full, and write the data from each storage server to tape. Result: data recorded at the same time is scattered over many tapes. How is the data read back? Generally, files grouped by time of creation. How to optimise for this? Group files on to a small number of tapes. Ooops 43 43
Keep users away from tape 44 44
CASTOR & EOS 45
Data Access Realism Mass Storage systems work well for recording, export and retrieval of production data. Good: This is what they were designed for! But some features of the CASTOR system developed at CERN are unused or ill-adapted experiments want to manage data availability file sizes, file-placement policies and access patterns interact badly alleviated by experiment management of data transfer between tape and disk analysis use favours low latency over guaranteed data rates aggravated by experiment management of data; automated replication of busy datasets is disabled. But we should not be too surprised: storage systems were designed many years before analysis patterns were understood. (If they are even today ) 46 46
Data Distribution The LHC experiments need to distribute millions of files between the different sites. The File Transfer System automates this handling failures of the underlying distribution technology (gridftp) ensuring effective use of the bandwidth with multiple streams, and managing the bandwidth use ensuring ATLAS, say, is guaranteed 50% of the available bandwidth between two sites if there is data to transfer 47
Data Distribution FTS uses the Storage Resource Manager as an abstract interface to the different storage systems A Good Idea but this is not (IMHO) a complete storage abstraction layer and anyway cannot hide fundamental differences in approaches to MSS design Lots of interest in the Amazon S3 interface these days; this doesn t try to do as much as SRM, but HEP should try to adopt de facto standards. Once you have distributed the data, a file catalogue is needed to record which files are available where. LFC, the LCG File Catalogue was designed for this role as a distributed catalogue to avoid a single point of failure, but other solutions are also used And as many other services rely on CERN, the need for a distributed catalogue is no longer (seen as ) so important. 48
Looking more widely I Only a small subset of data distributed is actually used Experiments don t know a priori which dataset will be popular CMS has 8 orders magnitude in access between most and least popular Dynamic data replication: create copies of popular datasets at multiple sites. 49 49
University n.10 6 MIPS m Tbyte Robot" Looking more widely II Network capacity is readily available 622 Mbits/s" and it is reliable: FNAL 4.10 7 MIPS Desk" 110 Tbyte So let s simply tops" Robot" copy data from another site if Desk" tops" it is not available locally rather than recalling from tape or failing the job. N x 622 Mbits/s" Inter-connectedness is increasing with the design of LHCOne to deliver (multi-) 10Gb links Desk" CERN between tops" Tier2s. n.10 7 MIPS m Pbyte Robot" MONARC Fibre 2000 cut during tests in 2009 Capacity reduced, but alternative links took over 50 50
Metadata Distribution Conditions data is needed to make sense of the raw data from Average the experiments Streams Throughput 45000 Data on items such as temperatures, detector voltages 40000 40000 40000 and gas compositions 37000 is needed to turn the ~100M Pixel 35000 34000 image of the event into a meaningful description in 30000 30000 terms of particles, tracks and momenta. LCR/s 25000 25000 This 20000 data is in an RDBMS, Oracle at CERN, and 15000 presents interesting distribution challenges 10000 One 5000 cannot tightly 4600 couple databases across the loosely 2800 1700 coupled 0 WLCG sites, for example row size = 100B row size = 500B row size = 1000B Oracle streams technology improved to deliver the Oracle 10g Oracle 11gR2 Oracle 11g R2 (opnmized) necessary performance, and http caching systems developed to address need for cross-dbms distribution. 51
Job Execution Environment Jobs submitted to sites depend on large, rapidly changing libraries of experiment specific code Major problems ensue if updated code is not distributed to every server across the grid (remember, there are x0,000 servers ) Shared filesystems can become a bottleneck if used as a distribution mechanism within a site. Approaches 2011 ATLAS Today: 22/1.8M files ATLAS Today: 921/115GB Pilot job framework can check to see if the execution host has the correct environment A global caching file system: CernVM-FS. 52 52
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 53
Learning from our mistakes We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. Clouds Towards the Future Identity Management 54
Learning from our mistakes We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. Clouds Towards the Future Identity Management 55
Learning from our mistakes We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. Clouds Towards the Future Identity Management 56
Integrating With The Cloud? User Site A Slide courtesy of Ulrich Schwickerath Central Task Queue Payload pull Instance requests VO service Site B Site C Shared Image Repository (VMIC) Image maintainer Cloud bursting Commercial cloud 57
Learning from our mistakes We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. Clouds Towards the Future Identity Management 58
Learning from our mistakes We have just completed a review of WLCG operations and services based on 2+ years of operations with the aim to simplify and harmonise during the forthcoming long shutdown. Key areas to improve are data management & access and exploiting many/multi-core architectures, especially with use of virtualisation. Clouds Towards the Future Identity Management 59
Compute Element Grid Middleware Basics Standard interface to local workload management systems (batch scheduler) Storage Element Standard interface to local mass storage systems Resource Broker Tool to analyse user job requests (input data sets, cpu time, data output requirements) and route these to sites according to data and cpu time availability. Many implementations of the basic principles: Globus, VDT, EDG/EGEE, NorduGrid, OSG 60
Trust! 61
One step beyond? 62
HEP, CERN, LHC and LHC Experiments LHC Computing Challenge The Technique In outline In more detail Towards the Future Summary Outline 63
Summary WLCG has delivered the capability to manage and distribute the large volumes of data generated by the LHC experiments and the excellent WLCG performance has enabled physicists to deliver results rapidly. HEP datasets may not be the most complex or (any longer) massive, but in addressing the LHC computing challenges, the community has delivered the world s largest computing Grid, practical solutions to requirements for large-scale data storage, distribution and access, and a global trust federation enabling world-wide collaboration. 64 64
Thank You! And thanks to Vlado Bahyl, German Cancio, Ian Bird, Jakob Blomer, Eva Dafonte Perez, Fabiola Gianotti, Frédéric Hemmer, Jan Iven, Alberto Pace and Romain Wartel of CERN, Elisa Lanciotti of PIC and K. De, T. Maeno, and S. Panitkin of ATLAS for various unattributed graphics and slides. 65