First Experience with LCG Operation and the future... CERN openlab Board of Sponsors 3 rd April 2009 Ian Bird LCG Project Leader
The LHC Computing Challenge Signal/Noise: 10-9 Data volume High rate * large number of channels * 4 experiments 15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users 100 k of (today's) fastest CPUs 45 PB of disk storage Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere GRID technology Ian.Bird@cern.ch 2 2
WLCG what and why? A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments Managed and operated by a worldwide collaboration between the experiments and the participating computer centres The resources are distributed for funding and sociological reasons Our task is to make use of the resources available to us no matter where they are located We know it would be simpler to put all the resources in 1 or 2 large centres This is not an option... today Ian.Bird@cern.ch 3
Tier 0 at CERN: Acquisition, First pass processing Storage & Distribution 1.25 GB/sec (ions) Ian.Bird@cern.ch 4 Ian.Bird@cern.ch 4
Tier 0 Tier 1 Tier 2 Tier-0 (CERN): Data recording Initial data reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (~130 centres): Simulation End-user analysis Ian.Bird@cern.ch 5 Ian.Bird@cern.ch 5
First events
CERN Amsterdam/NIKHEF-SARA Bologna/CAF Taipei/ASGC TRIUMF BNL NGDF FNAL Lyon/CCIN2P3 FZK Barcelona/PIC RAL
CERN Amsterdam/NIKHEF-SARA Bologna/CAF Taipei/ASGC TRIUMF BNL NGDF FNAL Lyon/CCIN2P3 FZK Barcelona/PIC RAL
Ian.Bird@cern.ch 10
CERN + Tier 1 accounting - 2008 CPU Time Delivered Ratio of CPU : Wall_clock Times 1,600 1,400 1,200 MSI2K-days. 1,000 800 600 400 200 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec month (2008) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec month (2008) Disk Storage Used Tape Storage Used TeraBytes. 20,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec month (2008) Ian.Bird@cern.ch 11 TeraBytes. 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5,000 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec month (2008)
CMS Data Transfer History May 6th 2008 LHCC referees: CMS - Computing 12/32
10M files Test @ ATLAS (From S. Campana) Simon Fraser M.C. Vetterli LHCC review, CERN; Feb. 09 #13
20000000 18000000 16000000 14000000 LHCb 12000000 CMS 10000000 ATLAS 8000000 ALICE 6000000 Number of jobs/month 500k/day Main outstanding issues 4000000 2000000 related to service/site reliability 0 Alice ATLAS CMS LHCb Total Tier-1s 6.24 32.03 30.73 2.50 71.50 34.3% Tier-2s 961 9.61 52.2323 55.04 20.14 137.02 65.7% Total 15.85 84.26 85.77 22.64 208.52 From APEL accounting portal for Aug. 08 to Jan. 09; #s in MSI2k Ian.Bird@cern.ch 14
Database replication LCG 3-D The 3-D project is now finished runs as production service In full production Several GB/day user data can be sustained to all Tier 1s ~100 DB nodes at CERN and several 10 s of nodes at Tier 1 sites Very large distributed database deployment Used for several applications Experiment calibration data; replicating (central, read-only) file catalogues Ian.Bird@cern.ch 15
Reliabilities 100% 95% 90% 85% 80% 75% 70% 65% 60% 55% 50% Site Reliability: CERN + Tier 1s Tier 2 Reliabilities Jul-06 Aug-06 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-07 Mar-07 Apr-07 May-07 Jun-07 Jul-07 Aug-07 Sep-07 Oct-07 Nov-07 Dec-07 Jan-08 Feb-08 Mar-08 Apr-08 1.00 080 0.80 0.60 0.40 0.20 May-08 Jun-08 000 0.00 J Jul-08 Aug-08 Sep-08 Oct-08 Nov-08 Dec-08 Average Average - 8 best sites Target Average Top 50% Top 20% Improvement during CCRC and later is encouraging -Tests do not show full picture e.g. Hide experiment-specific issues, - OR of service instances probably too simplistic a) publish VO-specific tests regularly; b) rethink algorithm for combining service instances Ian.Bird@cern.ch 16
Improving Reliability Testing Task forces/challenges Monitoring Appropriate Followed up Ian.Bird@cern.ch 17
Multi(many)-core: Better memory use efficiency challenges Co-scheduling many similar processes onto a single box Parallelizing (multi-thread, MPI,...) the applications New technologies: Clouds + virtualisation Use as overflow resources for peak periods demonstrated Running our facilities as clouds use of virtualisation, management tools, etc. Buying resources directly... needs education of the funding agencies Lessons: simplicity of interfaces, usage patterns,... Virtualisation helps improve service reliability; simplifies facility management (tbc) and leaves apps to deal with dependencies Grid grid of cloud-like like objects Filesystems Lustre, Hadoop, NFS4.1, etc Can we use these to improve our service reliability? Usability? Messaging systems Use for integrating systems Web services across languages etc did not deliver Ian.Bird@cern.ch 18
Enabling Grids for E-sciencE What are the limitations & possible solutions? EGEE-III INFSO-RI-222667 19
Ian.Bird@cern.ch 20
Challenges cont... Simplification of data management Clouds don t help much here Abstraction SRM has added complexity How much is required? How can we simplify? What are the lessons to learn? Database access grid authn/authz would help... New Tier 0 centre We will run out of power, new centre planned, will it be ready when we need it??? Moving from EGEE to a European sustainable grid infrastructure Whilst maintaining a solid service Ian.Bird@cern.ch 21
EGEE EGI+NGIs EGI blueprint published in December; endorsed in January by 20 NGIs March policy board has selected Amsterdam as the location of EGI.org (body resp. for managing EGI) Initiation of transition process to create an EGI council MoU to be prepared as an interim measure to identify NGIs prepared to commit as described in the blueprint (start with Letters of Intent) Anticipate 1 st council meeting in May Task force to be established for preparation of EGI proposals, for EC calls anticipated to close in November EGEE has outlined a fairly detailed transition plan for the final year of the project But can only go so far Ian.Bird@cern.ch 22
EGI and WLCG WLCG cannot take the risk of assuming EGI will be in place at the end of EGEE-III We plan to ensure that services provided to use to day by EGEE are assured by our Tier 1 sites Support the formation of a glite consortium to support the middleware In parallel we work with EGI_DS and EC to try and ensure that EGI and the NGIs will deliver what we need Ian.Bird@cern.ch 23
Conclusions We have built a working system that will be used for first data taking But it has taken a lot longer than anticipated... and was a lot harder... and the reality of grids does not quite match the hype... We now have an opportunity to rethink how we want this to develop in the future Clearer ideas of what is needed And must consider the risks, maintainability, reliability, and complexity Change of funding model and new technologies provide opportunities Challenges: data management and reliability, reliability,... Should remember... Our goal is to enable the experiments computing, not necessarily to develop computer science (unless we have to...) Ian.Bird@cern.ch 24
WLCG timeline 2009-2010 EGEE-III ends EGI...??? 2009 2010 2011 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb CCRC 09 Tests to be scheduled SU PP running HI? Switch to SL5/64bit completed? 2009 Capacity commissioned 2010 Capacity commissioned Deployment of glexec/scas; CREAM; SRM upgrades; SL5 WN