Features and Future Frédéric Hemmer - CERN Deputy Head of IT Department BEGrid seminar Brussels, October 27, 2006 www.eu-egee.org www.glite.org
Outline Overview of EGEE EGEE glite Middleware Foundation services High Level services examples Software process Short Term plans Software Process & ETICS BEGrid Seminar, Brussels - October 27, 2006 2
EGEE Enabling Grids for E-sciencE Started in April 2004 Now in 2 nd phase with 91 partners in 32 countries The EGEE project Objectives Large-scale, production-quality grid infrastructure for e-science Attracting new resources and users from industry as well as science Maintain and further improve glite Grid middleware BEGrid Seminar, Brussels - October 27, 2006 3
Applications on EGEE Many applications from a growing numbers of domains Astrophysics MAGIC, Planck Computational Chemistry Earth Sciences Earth Observation, Solid Earth Physics, Hydrology, Climate Financial Simulation E-GRID Fusion Geophysics EGEODE High Energy Physics 35000 30000 4 LHC experiments (ALICE, ATLAS, CMS, LHCb) 25000 BaBar, CDF, DØ, ZEUS 20000 15000 Life Sciences 10000 5000 Bioinformatics (Drug Discovery, GPS@, Xmipp_MLrefine, etc.) Applications have moved from testing to routine and daily usage Jobs / day 0 Medical imaging (GATE, CDSS, gptm3d, SiMRI 3D, Jan- etc.) Feb- 05 05 Multimedia Material Sciences Mar- 05 Apr-05 May- 05 Jun- 05 Jul-05 Aug- 05 Sep- 05 Oct-05 Nov- 05 Dec- 05 Jan- 06 Feb- 06 Mar- 06 Apr-06 ~80-90% efficiency > 165 Virtual Organizations (VO) User Forum Book of abstracts: http://doc.cern.ch/archive/electronic/egee/tr/egee-tr-2006-005.pdf App deployment plan https://edms.cern.ch/document/722131/2 Presentations, posters and demos at EGEE06: http://www.eu-egee.org/egee06 BEGrid Seminar, Brussels - October 27, 2006 4
EGEE Grid Sites : Q1 2006 Enabling Grids for E-sciencE 200 180 160 140 120 100 EGEE: Steady growth over the lifetime of the project 80 60 40 20 0 Apr-04 Jun-04 Aug-04 Oct-04 Dec-04 Feb-05 sites Apr-05 Jun-05 Aug-05 Oct-05 Dec-05 30000 25000 20000 CPU 15000 10000 5000 0 Apr-04 Jun-04 Aug-04 Oct-04 Dec-04 Feb-05 Apr-05 Jun-05 Aug-05 Oct-05 Dec-05 Feb-06 EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage country sites country sites country sites Austria 2 India 2 Russia 12 Belgium 3 Ireland 15 Serbia 1 Bulgaria 4 Israel 3 Singapore 1 Canada 7 Italy 25 Slovakia 4 China 3 Japan 1 Slovenia 1 Croatia 1 Korea 1 Spain 13 Cyprus 1 Netherlands 3 Sweden 4 Czech Republic 2 FYROM 1 Switzerland 1 Denmark 1 Pakistan 2 Taipei 4 France 8 Poland 5 Turkey 1 Germany 10 Portugal 1 UK 22 Greece 6 Puerto Rico 1 USA 4 Hungary 1 BEGrid Romania Seminar, Brussels 1 -CERN October 27, 2006 1 5
EGEE What do we deliver? Infrastructure operation Currently includes ~200 sites across 40 countries Continuous monitoring of grid services & automated site configuration/management http://gridportal.hep.ph.ic.ac.uk/rtm/launch_frame.html Middleware Production quality middleware distributed under business friendly open source licence User Support - Managed process from first contact through to production usage Training Expertise in grid-enabling applications Online helpdesk Networking events (User Forum, Conferences etc.) Interoperability Expanding geographical reach and interoperability with collaborating e-infrastructures BEGrid Seminar, Brussels - October 27, 2006 6
Middleware Layers Enabling Grids for E-sciencE Applications Higher-Level Grid Services Workload Management Replica Management Visualization Workflow Grid Economies... Foundation Grid Middleware Security model and Infrastructure Computing (CE) and Storage Elements (SE) Accounting Information and Monitoring Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are helping the users building their computing infrastructure but should not be mandatory Foundation Grid Middleware is deployed on the EGEE infrastructure Must be complete and robust Should allow interoperation with other major grid infrastructures Should not assume the use of Higher-Level Grid Services BEGrid Seminar, Brussels - October 27, 2006 7
The glite Middleware Approach Exploit experience and existing components from VDT (Condor, Globus), EDG/LCG, and others glite is a distribution that combines components from many different providers! Develop, Test, Certify & Distribute a generic middleware stack useful to EGEE (and other) applications Pluggable components Follow SOA approach, WS-I compliant where possible Focus is on re-engineering and hardening Business friendly open source license Plan to switch to Apache-2 BEGrid Seminar, Brussels - October 27, 2006 8
glite Grid Middleware Services CLI Access API Security Authorization Auditing Authentication Information & Monitoring Information & Monitoring Application Monitoring Data Management Workload Management Metadata Catalog File & Replica Catalog Accounting Job Provenance Package Manager Storage Element Data Movement Site Proxy Computing Element Workload Management Overview paper http://doc.cern.ch//archive/electronic/egee/tr/egee-tr-2006-001.pdf BEGrid Seminar, Brussels - October 27, 2006 9
Grid Foundation: Computing Element Enabling Grids for E-sciencE The CE accepts batch jobs (and job control requests) through a gatekeeper, performs AAA, passes them to a LRMS, monitors the their execution and return results to the submitter WMS, Clients Information System Three flavours available now: LCG-CE (GT2 GRAM) in production now but will be phased-out by the end of the year glite-ce (GSI-enabled Condor-C) already deployed but still needs thorough testing and tuning. CREAM (WS-I based interface) Contribution to the OGF-BES group for a standard WS-I based CE interface Grid Site Computing Element glexec + LCAS/ LCMAPS bdii R-GMA CEMon BLAH BLAH is the interface to the local resource manager (via plug-ins) CREAM and glite-ce Information pass-through: pass parameters to the LRMS to help job scheduling WN LRMS BEGrid Seminar, Brussels - October 27, 2006 10
Grid Foundation: Storage Element Site File Name (SFN): identifies a Storage Element and the logical name of the file inside it Physical File Name (PFN): argument of file open Storage Resource Manager (SRM) hides the storage system implementation (disk or active tape) checks the access rights to the storage system and the files translates SFNs to PFNs disk-based: DPM, dcache; tape-based: Castor, dcache File I/O: posix-like access from local nodes or the grid GFAL BEGrid Seminar, Brussels - October 27, 2006 11
Example: The Disk Pool Manager Light-weight disk-based Storage Element Easy to install, configure, manage and to join or remove resources Integrated security (authentication/authorization) based on VOMS groups and roles All control and I/O services have security built-in: GSI or Kerberos 5 SRMv1 and SRMv2.1 interfaces. SRMv2.2 being added now Grid Client Data Server Name Server RFIO Client RFIO Daemon Disk System NS Daemon NS Database Gridftp Client SRM Client Gridftp Server RFIO Client SRM Server SRM Daemon Disk Pool Manager Request Daemon DPM Daemon DPM Database BEGrid Seminar, Brussels - October 27, 2006 12
Grid Foundation: Accounting Enabling Grids for E-sciencE Resource usage by VO, group or single user Resource metering: sensors running on resources to determine usage Pricing policies: associate a cost to resource usage if enabled allowed marketbased resource brokering privacy: access to accounting data granted only to authorized people (user, provider, VO manager) Basic functionality in APEL, full functionality in DGAS BEGrid Seminar, Brussels - October 27, 2006 13
High Level Services: Job Information Enabling Grids for E-sciencE Logging and Bookkeeping service Tracks jobs during their lifetime (in terms of events) Job Provenance stores long term job information Supports job rerun BEGrid Seminar, Brussels - October 27, 2006 14
High Level Services: Workload Management Enabling Grids for E-sciencE Resource brokering, workflow management, I/O data management Î Web Service interface: WMProxy Task Queue: keep non matched jobs Information SuperMarket: optimized cache of information system Match Maker: assigns jobs to resources according to user requirements Job submission & monitoring ÎCondor-G ÎCondor-C ÎICE (to CREAM) External interactions: Information System Data Catalogs Logging&Bookkeeping Policy Management system (G-PBox) BEGrid Seminar, Brussels - October 27, 2006 15
High Level Services : FTS Reliable and manageable File Transfer System for VOs Transfers are treated as jobs May be split onto multiple channels Channels are point-to-point or catch-all (only one end fixed). More flexible channel definitions on the way... New features that will be available in production soon: Cleaner error reporting and service monitoring interfaces Proxy renewal and delegation SRMv2.2 support Longer term development: Optimized SRM interaction split preparation from transfer Better service management controls Notification of finished jobs Pre-staging tape support Catalog & VO plug-ins framework Allow catalog registration as part of transfer workflow BEGrid Seminar, Brussels - October 27, 2006 16
Encrypted Data Storage encrypt and decrypt data on-the-fly Key-store: Hydra N instances: at least M (<N) need to be available for decryption fault tolerance and security High Level Services: EDS Will be DPM (now d-cache) Will be LFC Demonstrated with the SRM- DICOM demo at EGEE Pisa conference (Oct 05) Will be GFAL BEGrid Seminar, Brussels - October 27, 2006 17
Main focus for the developers Give support on the production infrastructure (GGUS, 2 nd line support) Fix defects found on the production software Support SL(C)4 and 64bit architectures (x86-64 first) Participate to Task Forces together with applications and site experts and improve scalability Improve robustness and usability (efficiency, error reporting,...) Address requests for functionality improvements from users, site administrators, etc... (through the Technical Coordination Group) Improve adherence to international standards and interoperability with other infrastructures Deploy and expose to users new components on the preview testbed Interoperability with Shibboleth Work plans available at: https://twiki.cern.ch/twiki/bin/view/egee/egeegliteworkplans BEGrid Seminar, Brussels - October 27, 2006 18
Highlights: Shibboleth Shibboleth Federation of campus infrastructures Developed by Internet2 Allows Single Sign On for web-based resources Based on SAML (Security Assertion Markup Language ) Manages an Authentication and Authorization Infrastructure (AAI) based on Shibboleth with about 160 000 users of the Swiss higher education sector Activity started in 2002; in production since last summer about 12 000 use SWITCHaai on a regular basis Interoperability with glite Specific for EGEE-2 infrastructure NO replacement for X.509, VOMS,... Home institution of the user is the Identity Provider Attributes both from home institution and the VO BEGrid Seminar, Brussels - October 27, 2006 19
WMS Performance Results ~20000 jobs submitted 3 parallel UIs 33 Computing Elements 200 jobs/collection Bulk submission Performances ~ 2.5 h to submit all jobs 0.5 seconds/job ~ 17 hours to transfer all jobs to a CE 3 seconds/job 26000 jobs/day Job failures Negligible fraction of failures due to the glite WMS Either application errors or site problems By A.Sciabà jobs 20000 17500 15000 12500 10000 7500 5000 2500 Failure reason 0 Application error 0 200 400 600 800 1000 1200 1400 x 10 2 Remote batch system CRL expired Worker Node problem Gatekeeper down time (sec) Job fraction (%) 28 3.9 3.3 1.1 0.2 BEGrid Seminar, Brussels - October 27, 2006 20
glite Software Process Enabling Grids for E-sciencE Development Directives Error Fixing Software Serious problem Integration Certification Pre-Production Deployment Packages Problem Fail Production Infrastructure Integration Tests Testbed Deployment Fail Pass Functional Tests Release Installation Guide, Release Notes, etc Pre-Production Deployment Pass Pass Fail Scalability Tests BEGrid Seminar, Brussels - October 27, 2006 21
glite Software Process Technical Coordination Group (TCG) Gathers & prioritizes user requirements from HEP, Biomed, (industry), sites glite development is client-driven! Software from EGEE-JRA1 and other projects JRA1 preview test-bed (currently being set up) early exposure to users of uncertified components SA3 Integration Team Ensures components are deployable and work Deployment Modules implemented high-level glite node types (WMS, CE, R-GMA Server, VOMS Server, FTS, etc) Build system now spun off into the ETICS project SA3 Certification Team Dedicated test-bed; test release candidates and patches Develop test suites SA1 Pre-Production System Scale tests by users BEGrid Seminar, Brussels - October 27, 2006 22
Web Application Web Service ETICS Via browser Report DB Project DB Build/Test Artefacts NMI Scheduler Via command- Line tools Clients WNs NMI Client ETICS Infrastructure BEGrid Seminar, Brussels - October 27, 2006 23
Summary EGEE is a global effort, and the largest multi-science Grid infrastructure worldwide glite 3.0 is an important milestone in EGEE program New components from glite 1.X developed in the first phase of EGEE are being deployed for the first time on the Production Infrastructure Addressing application ad operations requirements in terms of functionality and scalability New build and integration environment from ETICS Controlled software process and certification Development is application driven (TCG) Collaboration with other projects for interoperability and definition/adoption of international standards BEGrid Seminar, Brussels - October 27, 2006 24
www.glite.org www.eu-egee.org BEGrid Seminar, Brussels - October 27, 2006 25
Grids in Europe Large European investment in developing Grid technology Sample of National Grid projects: Austrian Grid Initiative Belgium: BEgrid DutchGrid France: Grid 5000 Germany: D-Grid; Unicore Greece: HellasGrid Grid Ireland Italy: INFNGrid; GRID.IT NorduGrid Portuguese Grid Swiss Grid UK e-science: National Grid Service; OMII; GridPP Multi-national, multi-science Grid infrastructures are a priority of the EC: DEISA, EGEE plus several supporting projects BEGrid Seminar, Brussels - October 27, 2006 26
Evolution National European e-infrastructure Testbeds Global Routine Usage Utility Service BEGrid Seminar, Brussels - October 27, 2006 27
Why Sustainability? Scientific applications start to depend on Grid infrastructures e.g. EGEE supports well Sep. over 100 VOs, running over 06 50.000 jobs/day Require long-term support Jobs/month 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 Jan-05 Feb-05 Mar-05 Apr-05 May-05 Jun-05 Jul-05 Aug-05 Sep-05 EGEE workload Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 Jan. 06 May-06 New scientific collaborations have been formed thanks to the Grid infrastructure E.g. WISDOM (http://wisdom.healthgrid.org) Business and Industry are getting very interested but need a long term perspective E.g. over 20 companies were present at the Business Track during the EGEE 06 conference, September, 2006 Jun-06 Jul-06 Aug-06 other VOs ops >50k jobs/day magic lhcb geant4 Virtual Organizations planck fusion esr egrid egeode dteam compchem cms biomed atlas alice BEGrid Seminar, Brussels - October 27, 2006 28