Operating the Distributed NDGF Tier-1 Michael Grønager Technical Coordinator, NDGF International Symposium on Grid Computing 08 Taipei, April 10th 2008
Talk Outline What is NDGF? Why a distributed Tier-1? Services Computing Storage Databases VO Specific Operation Results 2
Nordic DataGrid Facility A Co-operative Nordic Data and Computing Grid facility Nordic production grid, leveraging national grid resources Common policy framework for Nordic production grid Joint Nordic planning and coordination Operate Nordic storage facility for major projects Co-ordinate & host major escience projects (i.e., Nordic WLGC Tier-1) Develop grid middleware and services NDGF 2006-2010 Funded (2 M /year) by National Research Councils of the Nordic Countries DK NOS-N SF N S Nordic Data Grid Facility3
Nordic DataGrid Facility Nordic Participation in Big Science: WLCG the Worldwide Large Hadron Collider Grid Gene databases for bio-informatics sciences Screening of CO2-Sequestration suitable reservoirs ESS European Spallation Source Astronomy projects Other... 4
Why a Distributed Tier-1?
Why a Distributed Tier-1? Computer centers are small and distributed
Why a Distributed Tier-1? Computer centers are small and distributed Even the biggest adds up to 7
Why a Distributed Tier-1? Computer centers are small and distributed Even the biggest adds up to 7 Strong Nordic HEP community
Why a Distributed Tier-1? Computer centers are small and distributed Even the biggest adds up to 7 Strong Nordic HEP community Technical reasons: Added redundancy
Why a Distributed Tier-1? Computer centers are small and distributed Even the biggest adds up to 7 Strong Nordic HEP community Technical reasons: Added redundancy Only one 24x7 center
Why a Distributed Tier-1? Computer centers are small and distributed Even the biggest adds up to 7 Strong Nordic HEP community Technical reasons: Added redundancy Only one 24x7 center Fast inter Nordic network
Organization Tier-1 related
Tier-1 Services Storage Tape and Disk Computing well connected to storage Network - part of the LHC OPN Databases: 3D for e.g. ATLAS LFC for indexing files File Transfer Service Information systems Monitoring Accounting VO Services: ATLAS specific Taipei, ISGC08, April 2008 ALICE specific
Resources at Sites Storage is distributed Computing is distributed Many services are distributed But the sites are heterogeneous...
Resources at Sites
Computing A distributed compute center uses a grid for LRMS... Need to run on all kind of Linux distributions Use resources optimally Easy to deploy
Computing A distributed compute center uses a grid for LRMS... Need to run on all kind of Linux distributions Use resources optimally Easy to deploy NorduGrid/ARC! Already deployed Runs on all Linux flavors Uses resources optimally
Computing A distributed compute center uses a grid for LRMS... Need to run on all kind of Linux distributions Use resources optimally Easy to deploy NorduGrid/ARC! Already deployed Runs on all Linux flavors Uses resources optimally glite keeps nodes idle in up/download
Computing A distributed compute center uses a grid for LRMS... Need to run on all kind of Linux distributions Use resources optimally Easy to deploy NorduGrid/ARC! Already deployed Runs on all Linux flavors Uses resources optimally ARC uses the CE for datahandling
Storage 20
Storage 21
Storage
Storage
Storage dcache Java based so runs even on Windows! Separation between resources and services Open source Pools at sites Doors and Admin nodes centrally Part of the development Added GridFTP2 to bypass door nodes in transfers Various improvements a tweaks for distributed use Central services at the GEANT endpoint
Storage
Network Dedicated 10GE to CERN via GEANT (LHCOPN) Örestaden Dedicated 10GENORDUnet betweennren participating Tier-1 sites NDGF AS - AS39590 National Switch National Sites National FI SE DK NO Central host(s) CERN LHC HPC2N IP PDC network NSC......
Other Tier-1 Services Catalogue: RLS & LFC FTS File Transfer Service 3D Distributed Database Deployment SGAS -> APEL Service Availability Monitoring via ARCCE SAM sensors
ATLAS Services So far part of Dulcinea Moving to PanDa The act (ARC Control Tower aka the fat pilot ) PanDa improves glite performance through better data handling (similar to ARC) Moving RLS to LFC
ALICE Services Many VO Boxes one pr site Aalborg, Bergen, Copenhagen, Helsinki, Jyväskylä, Linjköping, Lund, Oslo, Umeaa Central VO Box integrating distributed dcache with xrootd Ongoing efforts to integrate ALICE and ARC
NDGF Facility - 2008Q1
Operations
Operation
Operation 1st line support (in operation) NORDUnet 2nd line support (in operation) Operator NOC 24x7 on Duty 8x365 3rd line support (in operation) NDGF Operation Staff Sys Admins at sites Shared tickets with NUNOC
People
Results - Accounting According to EGEE Accounting Portal for 2007: NDGF contributed to 4% of all EGEE NDGF was the 5th biggest EGEE site NDGF was the 3rd biggest ATLAS Tier-1 worldwide NDGF was the biggest European ATLAS Tier-1
Results - Reliability NDGF has been running SAM tests since 2007Q3 Overall 2007Q4 reliability was 96% Which made us the most reliable Tier-1 in the world
Results - Efficiency The efficiency of the NorduGrid cloud (NDGF + Tier-2/3s using ARC) was 93% Result was mainly due to: High middleware efficiency High reliability This was due to: Distributed setup Professional operation team
Worries Can re-constructions run on a distributed setup High data throughput Low CPU consumption NDGF, Triumph and BNL reprocessed M5 data in February in the CCRC08-1 Shown to work Bottleneck was 3D DB (which is running on only one machine)
Looking ahead... The Distributed Tier-1 a success High efficiency High reliability Passed the CCRC08-1 tests Partnering with EGEE on: Operation (taking part in CIC on Duty) Interoperability Tier-2s under setup CMS will use glite interoperability to run on ARC
Thanks! Questions? 40