The ARDA project: Grid analysis prototypes of the LHC experiments

RAL, 13 May 2004 http://cern.ch/arda The ARDA project: Grid analysis prototypes of the LHC experiments Massimo Lamanna ARDA Project Leader Massimo.Lamanna@cern.ch www.eu-egee.org cern.ch/lcg EGEE is a project funded by the European Union under contract IST-2003-508833

RAL, 13 May 2004-2 Contents ARDA Project Mandate and organisation ARDA activities during 2004 General pattern LHCb CMS ATLAS ALICE Conclusions and Outlook

ARDA working group recommendations: our starting point RAL, 13 May 2004-3 New service decomposition Strong influence of Alien system the Grid system developed by the ALICE experiments and used by a wide scientific community (not only HEP) Role of experience, existing technology Web service framework EGEE Middleware Interfacing to existing middleware to enable their use in the experiment frameworks Early deployment of (a series of) prototypes to ensure functionality and coherence ARDA project

RAL, 13 May 2004-4 EGEE and LCG Strong links already established between EDG and LCG. It will continue in the scope of EGEE The core infrastructure of the LCG and EGEE grids will be operated as a single service, and will grow out of LCG service LCG includes many US and Asia partners EGEE includes other sciences Substantial part of infrastructure common to both Parallel production lines as well LCG-2 2004 data challenges Pre production prototype EGEE MW ARDA playground for the LHC experiments ARDA LCG-1 LCG-2 EGEE-1 EGEE-2 VDT/EDG EGEE WS MW

RAL, 13 May 2004-5 End-to-end prototypes: why? Provide a fast feedback to the EGEE MW development team Avoid uncoordinated evolution of the middleware Coherence between users expectations and final product Experiments ready to benefit from the new MW as soon as possible Frequent snapshots of the middleware available Expose the experiments (and the community in charge of the deployment) to the current evolution of the whole system Experiments system are very complex and still evolving Move forward towards new-generation real systems (analysis!) Prototypes should be exercised with realistic workload and conditions No academic exercises or synthetic demonstrations LHC experiments users absolutely required here!!! EGEE Pilot Application A lot of work (experience and useful software) is involved in current experiments data challenges Concrete starting point Adapt/complete/refactorise the existing: we do not need another system!

RAL, 13 May 2004-6 End-to-end prototypes: how? The initial prototype will have a reduced scope Components selection for the first prototype Experiments components not in use for the first prototype are not ruled out (and used/selected ones might be replaced later on) Not all use cases/operation modes will be supported Every experiment has a production system (with multiple backends, like PBS, LCG, G2003, NorduGrid, ). We focus on end-user analysis on a EGEE MW based infrastructure Adapt/complete/refactorise the existing experiment (sub)system! Collaborative effort (not a parallel development) Attract and involve users Many users are absolutely required Informal Use Cases are still being defined, e.g.: A physicist selects a data sample (from current Data Challenges) With an example/template as starting point (s)he prepares a job to scan the data The job is split in sub-jobs, dispatched to the Grid, some error-recovery is automatically performed, merged back in a single output The output (histograms, ntuples) is returned together with simple information on the job-end status

RAL, 13 May 2004-7 ARDA @ Regional Centres Deployability is a key factor of MW success A few Regional Centres will have the responsibility to provide early installation for ARDA Understand Deployability issues Extend the ARDA test bed The ARDA test bed will be the next step after the most complex EGEE Middleware test bed Stress and performance tests could be ideally located outside CERN This is for experiment-specific components (e.g. a Meta Data catalogue) Leverage on Regional Centre local know how Data base technologies Web services Pilot sites might enlarge the resources available and give fundamental feedback in terms of deployability to complement the EGEE SA1 activity (EGEE/LCG operations) Running ARDA pilot installations Experiment data available where the experiment prototype is deployed

RAL, 13 May 2004-8 Coordination and forum activities The coordination activities would flow naturally from the fact that ARDA will be open to provide demonstration benches Since it is neither necessary nor possible that all projects could be hosted inside the ARDA experiments prototypes, some coordination is needed to ensure that new technologies can be exposed to the relevant community Transparent process ARDA should organise a set of regular meetings (one per quarter?) to discuss results, problems, new/alternative solutions and possibly agree on some coherent program of work. The ARDA project leader organises this activity which will be truly distributed and lead by the active partners ARDA is embedded in EGEE NA4 namely NA4-HEP Special relation with LCG GAG LCG forum for Grid requirements and use cases Experiments representatives coincide with the EGEE NA4 experiments representatives ARDA will channel this information to the appropriate recipients ARDA workshop (January 2004 at CERN; open; over 150 participants) ARDA workshop (June 21-23 at CERN; by invitation) The first 30 days of EGEE middleware NA4 meeting mid July (NA4/JRA1 and NA4/SA1 sessions foreseen. Organised by M. Lamanna and F. Harris) ARDA workshop (September 2004?; open)

RAL, 13 May 2004-9 People Massimo Lamanna Birger Koblitz Andrey Demichev Viktor Pose Russia Dietrich Liko Frederik Orellana ALICE Wei-Long Ueng Tao-Sheng Chen Taiwan Derek Feichtinger Andreas Peters Julia Andreeva Juha Herrala Andrew Maier Kuba Moscicki ATLAS CMS LHCb Experiment interfaces Piergiorgio Cerello (ALICE) David Adams (ATLAS) Lucia Silvestris (CMS) Ulrik Egede (LHCb)

RAL, 13 May 2004-10 Example of activity Existing system as starting point Every experiment has different implementations of the standard services Used mainly in production environments Few expert users Coordinated update and read actions ARDA Interface with the EGEE middleware Verify (help to evolve to) such components to analysis environments Many users» Robustness Concurrent read actions» Performance One prototype per experiment A Common Application Layer might emerge in future ARDA emphasis is to enable each of the experiment to do its job Milestone Date Description Very very soon 1.x.1 May 2004 E2E x prototype definition agreed with the experiment 1.x.2 September 2004 E2E x prototype using basic EGEE middleware 1.x.3 November 2004 E2E x prototype improved functionality 1.x December 2004 E2E prototype for experiment x, capable of analysis Already started 2.x December 2005 E2E prototype for experiment x, capable of analysis and production

RAL, 13 May 2004-11 LHCb The LHCb system within ARDA uses GANGA as principal component (see next slide). The LHCb/GANGA plans: enable physicists (via GANGA) to analyse the data being produced during 2004 for their studies It naturally matches the ARDA mandate Have the prototype where the LHCb data will be the key At the beginning, the emphasis will be to validate the tool focusing on usability, validation of the splitting and merging functionality for users jobs The DIRAC system (LHCb grid system, used mainly in production so far, could be a useful playground to understand the detailed behaviour of some components, like the file catalog)

RAL, 13 May 2004-12 GANGA Gaudi/Athena and Grid Alliance Gaudi/Athena: LHCb/ATLAS frameworks The Athena uses Gaudi as a foundation Single desktop for a variety of tasks Help configuring and submitting analysis jobs Keep track of what they have done, hiding completely all technicalities Resource Broker, LSF, PBS, DIRAC, Condor Job registry stored locally or in the roaming profile Automate config/submit/monitor procedures Provide a palette of possible choices and specialized plug-ins (pre-defined application configurations, batch/grid systems, etc.) UI GUI JobOptions Algorithms BkSvc GANGA GAUDI Program Internal Model Histograms Monitoring Results Collective & Resource Grid Services WLM ProSvc Monitor GANGA Friendly user interface (CLI/GUI) is essential GUI Wizard Interface Help users to explore new capabilities Browse job registry Scripting/Command Line Interface Automate frequent tasks python shell embedded into the Ganga GUI Bookkeeping Service SE WorkLoad Manager CE File catalog Profile Service Grid Services Instr. GAUDI Program

RAL, 13 May 2004-13 ARDA contribution to Ganga Integration with EGEE middleware Waiting for the EGEE middleware, we developed an interface to Condor Use of Condor DAGMAN for splitting/merging and error recovery capability Design and Development Command Line Interface Future evolution of Ganga Release management Software process and integration Testing, tagging policies etc. Infrastructure Installation, packaging etc.

RAL, 13 May 2004-14 LHCb Metadata catalog Used in production (for large productions) Web Service layer being developed (main developers in the UK) Oracle backend ARDA contributes a testing focused on the analysis usage Robustness Performances under high concurrency (read mode) Measured network rate vs no. of concurrent clients

RAL, 13 May 2004-15 CERN/Taiwan tests Network monitor Virtual Users Client Oracle DB CERN Bookkeeping Server Web & XML-RPC Service performance tests CPU Load Network Process time Clone Bookkeeping DB in Taiwan Install the WS layer Performance Tests Database I/O Sensor Bookkeeping Server performance tests Taiwan/CERN Bookkeeping Server DB XML-RPC Service performance tests CPU Load, Network send/receive sensor, Process time Client Host performance tests CPU Load, Network send/receive sensor, Process time DB I/O Sensor Oracle DB CPU Load Network Process time Bookkeeping Server TAIWAN

RAL, 13 May 2004-16 CMS The CMS system within ARDA is still under discussion Provide easy access (and possibly sharing) of data for the CMS users is a key issue RefDB is the bookkeeping engine to plan and steer the production across different phases (simulation, reconstruction, to some degree into the analysis phase) It contained all necessary information except file physical location (RLS) and info related to the transfer management system (TMDB) The actual mechanism to provide these data to analysis users is under discussion Measuring performances underway (similar philosophy as for the LHCb Metadata catalog measurements) Reconstruction instructions RefDB McRunjob Tapes RefDB in CMS DC04 Reconstruction jobs Reconstructed data Summaries of successful jobs T0 worker nodes GDB castor pool Export Buffers Reconstructed data Checks what has arrived Updates RLS Transfer agent Updates TMDB

RAL, 13 May 2004-17 ATLAS The ATLAS system within ARDA has been agreed ATLAS has a complex strategy for distributed analysis, addressing different area with specific projects (Fast response, user-driven analysis, massive production, etc : see http://www.usatlas.bnl.gov/ada/) Starting point is the DIAL system The AMI metadata catalog is a key component mysql as a back end Genuine Web Server implementation Robustness and performance tests from ARDA In the start up phase, ARDA provided some help in developing ATLAS production tools Being finalised

RAL, 13 May 2004-18 What is DIAL? Interactive analysis e.g. ROOT, JAS,... DIAL Dataset Job Scheduler AAA Distributed processing running data-specific application

RAL, 13 May 2004-19 AMI studies in ARDA Atlas Metadata- Catalogue, contains File Metadata: Simulation/Reconstruction-Version Does not contain physical filenames Many problems still open: Large network traffic overhead due to schema independent tables SOAP proxy supposed to provide DB access Note that Web Services are stateless (not automatic handles to have the concept of session, transaction, etc ): 1 query = 1 (full) response Large queries might crashed server Shall proxy re-implement all database functionality? Good collaboration in place with ATLAS- Grenoble User User User SOAP-Proxy Meta-Data (MySQL) Studied behaviour using many concurrent clients:

ALICE: Grid enabled PROOF SuperComputing 2003 (SC2003) Demo RAL, 13 May 2004-20 PROOF SLAVES Site C Site A PROOF SLAVES TcpRouter Site B PROOF SLAVES Strategy: TcpRouter The ALICE/ARDA will evolve the analysis system presented by ALICE at SuperComputing 2003 With the new EGEE middleware (at SC2003, AliEn was used) Activity on PROOF Robustness Error recovery PROOF TcpRouter USER SESSION PROOF MASTER SERVER TcpRouter

RAL, 13 May 2004-21 ALICE-ARDA prototype improvements SC2003: The setup was heavily connected with the middleware services Somewhat inflexible configuration No chance to use PROOF on federated grids like LCG in AliEn TcpRouter service needs incoming connectivity in each site Libraries can not be distributed using the standard rootd functionality Improvement ideas: Distribute another daemon with ROOT, which replaces the need for a TcpRouter service Connect each slave proofd/rootd via this daemon to two central proofd/rootd master multiplexer daemons, which run together with the proof master Use Grid functionality for daemon start-up and booking policies through a plug-in interface from ROOT Put PROOF/ROOT on top of the grid services Improve on dynamic configuration and error recovery

RAL, 13 May 2004-22 ALICE-ARDA improved system PROOF SLAVE SERVERS Proxy proofd Proxy rootd Grid Services Booking The remote proof slaves look like a local proof slave on the master machine Booking service is usable also on local clusters PROOF Master

RAL, 13 May 2004-23 Conclusions and Outlook ARDA is starting Main tool: experiment prototypes for analysis Detailed project plan being prepared Good feedback from the LHC experiments Good collaboration with EGEE NA4 Good collaboration with Regional Centres. More help needed Look forward to contribute to the success of EGEE Helping EGEE Middleware to deliver a fully functionally solution ARDA main focus Collaborate with the LHC experiments to set up the end-to-end prototypes Aggressive schedule First milestone for the end-to-end prototypes is Dec 2004