Workload Management. Stefano Lacaprara. CMS Physics Week, FNAL, 12/16 April Department of Physics INFN and University of Padova

Similar documents
PoS(ACAT)020. Status and evolution of CRAB. Fabio Farina University and INFN Milano-Bicocca S. Lacaprara INFN Legnaro

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

CRAB tutorial 08/04/2009

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

THE Compact Muon Solenoid (CMS) is one of four particle

Distributed Computing Grid Experiences in CMS Data Challenge

Tutorial for CMS Users: Data Analysis on the Grid with CRAB

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

GRID COMPUTING APPLIED TO OFF-LINE AGATA DATA PROCESSING. 2nd EGAN School, December 2012, GSI Darmstadt, Germany

PoS(ACAT2010)029. Tools to use heterogeneous Grid schedulers and storage system. Mattia Cinquilli. Giuseppe Codispoti

LHCb Computing Strategy

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

CMS Belgian T2. G. Bruno UCL, Louvain, Belgium on behalf of the CMS Belgian T2 community. GridKa T1/2 meeting, Karlsruhe Germany February

Status of KISTI Tier2 Center for ALICE

WMS overview and Proposal for Job Status

Philippe Charpentier PH Department CERN, Geneva

glite Grid Services Overview

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

PoS(ACAT2010)039. First sights on a non-grid end-user analysis model on Grid Infrastructure. Roberto Santinelli. Fabrizio Furano.

LHCb Distributed Conditions Database

Distributed Computing Framework. A. Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

The Grid. Processing the Data from the World s Largest Scientific Machine II Brazilian LHC Computing Workshop

The EU DataGrid Testbed

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

I Service Challenge e l'implementazione dell'architettura a Tier in WLCG per il calcolo nell'era LHC

The glite middleware. Ariel Garcia KIT

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Challenges of the LHC Computing Grid by the CMS experiment

Spanish Tier-2. Francisco Matorras (IFCA) Nicanor Colino (CIEMAT) F. Matorras N.Colino, Spain CMS T2,.6 March 2008"

Database monitoring and service validation. Dirk Duellmann CERN IT/PSS and 3D

Batch Services at CERN: Status and Future Evolution

Andrea Sciabà CERN, Switzerland

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

CHIPP Phoenix Cluster Inauguration

Experience of Data Grid simulation packages using.

Service Availability Monitor tests for ATLAS

( PROPOSAL ) THE AGATA GRID COMPUTING MODEL FOR DATA MANAGEMENT AND DATA PROCESSING. version 0.6. July 2010 Revised January 2011

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

LCG-2 and glite Architecture and components

The LHC Computing Grid

Clouds at other sites T2-type computing

The Grid: Processing the Data from the World s Largest Scientific Machine

Experience with Data-flow, DQM and Analysis of TIF Data

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

The grid for LHC Data Analysis

A High Availability Solution for GRID Services

CMS HLT production using Grid tools

Storage Resource Sharing with CASTOR.

Data Grid Infrastructure for YBJ-ARGO Cosmic-Ray Project

Grid Infrastructure For Collaborative High Performance Scientific Computing

The CMS Computing Model

Understanding StoRM: from introduction to internals

RDMS CMS Computing Activities before the LHC start

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

Monte Carlo Production on the Grid by the H1 Collaboration

Distributed Monte Carlo Production for

The INFN Tier1. 1. INFN-CNAF, Italy

ARC integration for CMS

AGATA Analysis on the GRID

Introduction to Grid Infrastructures

Real-time dataflow and workflow with the CMS tracker data

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Storage and I/O requirements of the LHC experiments

Event cataloguing and other database applications in ATLAS

High Throughput WAN Data Transfer with Hadoop-based Storage

Interconnect EGEE and CNGRID e-infrastructures

EUROPEAN MIDDLEWARE INITIATIVE

CMS LHC-Computing. Paolo Capiluppi Dept. of Physics and INFN Bologna. P. Capiluppi - CSN1 Catania 18/09/2002

HEP Grid Activities in China

Austrian Federated WLCG Tier-2

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

150 million sensors deliver data. 40 million times per second

EU DataGRID testbed management and support at CERN

UW-ATLAS Experiences with Condor

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

AMGA metadata catalogue system

DIRAC pilot framework and the DIRAC Workload Management System

arxiv: v1 [physics.ins-det] 1 Oct 2009

Clouds in High Energy Physics

Ganga The Job Submission Tool. WeiLong Ueng

where the Web was born Experience of Adding New Architectures to the LCG Production Environment

Edinburgh (ECDF) Update

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

handling of LHE files in the CMS production and usage of MCDB

AliEn Resource Brokers

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

The Legnaro-Padova distributed Tier-2: challenges and results

ALHAD G. APTE, BARC 2nd GARUDA PARTNERS MEET ON 15th & 16th SEPT. 2006

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

Overview of ATLAS PanDA Workload Management

Oracle Enterprise Manager 12c IBM DB2 Database Plug-in

Using ssh as portal The CMS CRAB over glideinwms experience

A short introduction to the Worldwide LHC Computing Grid. Maarten Litmaath (CERN)

Transcription:

Workload Management Stefano Lacaprara Department of Physics INFN and University of Padova CMS Physics Week, FNAL, 12/16 April 2005

Outline 1 Workload Management: the CMS way General Architecture Present Future 2 CMS SW deployment Job clustering Data discovery and Location Data access monitoring Job Monitoring Output handling Users Support Open Science Grid integration

Outline 1 Workload Management: the CMS way General Architecture Present Future 2 CMS SW deployment Job clustering Data discovery and Location Data access monitoring Job Monitoring Output handling Users Support Open Science Grid integration

Outline 1 Workload Management: the CMS way General Architecture Present Future 2 CMS SW deployment Job clustering Data discovery and Location Data access monitoring Job Monitoring Output handling Users Support Open Science Grid integration

General Architecture Workload Management in CMS. Baseline solution for CMS Use (sometime abuse) only a fraction of Grid (LCG2) functionalities Does not even try to solve the general problem but focus on specific use case, the most common for CMS user Access to distributed data with batch jobs using CMS application Actual architecture based on following assumption: Data is already located on remote sites Local Pool catalogs available in remote sites CMS sw deployed and available on remote sites

General Architecture Simplify a lot Data Management! Data distributed on Dataset basis Dataset is atomic: complete and unbreakable Each dataset has different data tiers: Hit, Digis, DST,... Each considered independently User input data is a dataset with given data tier(s) Data discovery and location based on CMS specific services: RefDB and PubDB RefDB: central database knows of all produced datasets PubDB: remote database (one per site publishing data) contains local information about dataset access, including CE and local file catalog location RefDB knows about PubDB(s) publishing given dataset

General Architecture CRAB CMS Remote Analysis Builder. CMS specific tool for Workload Management Perform all needed task to actual run user code on Grid environment User friendly interface to grid services for CMS user User is supposed to be able to develop and run her analysis code interactively Directive given to CRAB via configuration files: Dataset/Owner she want to access Type of data tiers she needs (DST, Digis,... ) Job splitting directives (# event per jobs, total number of events,... ) Name of Executable Configuration.orcarc cards: the one she uses locally!

General Architecture CRAB CMS Remote Analysis Builder. CRAB functionalities User job preparation: pack user private libraries and executable, prepare jdls, wrappers, etc... Dataset discovery and location Job splitting Changes of ORCA configuration files to run on remote site (including catalogs, splitting, etc... ) Job submission, tracking Simple monitoring Automatic output retrieval at the end Or save it to SE or gsiftp server (e.g. castor!) Grid details are hidden to user

General Architecture CRAB Status. Early stage of development (version 0 1 1) Actively developed to cope with (many) user requirements Actively used by many CMS end users O(10 s), with little or no Grid knowledge Already several physics presentation based on data accessed via CRAB Successfully used to access from any UI data at Tiers-1 (and some T2) FNAL (US) (yesterday O(200) CRAB jobs running!) CNAF (Italy) PIC (Spain) CERN FZK (Germany) IN2P3 (France) RAL (UK): still working Tiers-2: Legnaro, Bari, Perugia (Italy) Estimated grand total O(10 7 ) events

General Architecture Workflow UI Submission Tool jdl, job RB Data Location Service User develops code on local UI Use CRAB for Grid submission Input is Data to be accessed, code Job preparation (private code, splitting, submission,... ) Create wrapper job to be submitted to Grid Output jdl, job LocalFileCatalog RB (or tool) uses Data Location Service to find good Site CE WN... Output Data SE Job arrives to Working Node and runs against local Data using a local FileCatalog Output is retrieved or stored on Storage Element

Present UI Output Submission Tool jdl, job RB jdl, job Dataset Discovery RefDB Dataset Discovery (2) Site and File Catalog URL PIC CNAF PubDB CE WN... Data SE Output LNL PubDB PubDB Knows File Catalog Dataset Discovery: RefDB (CERN) and PubDB (one per site) RefDB knows which PubDBs publishing data Each PubDBs publish site (CE) Local PubDBs knows about Dataset details (# events,...) and URL of local FileCatalog(s) Submission tool query RefDB & eligible PubDBs find Dataset location (CEs) and tell the RB (as requirement) RB ship job to one of the possible CEs From WN, LocalFileCatalog (xml or mysql) knows file location (used by COBRA)

Future Data Location and Access: future UI Output Submission Tool CE WN... Output RB jdl, job CMS query KEY jdl, job, key Data Dataset Bookkeeping Service KEY SEs KEY SEs Dataset Discovery Local File Catalog SE Dataset Location Service Dataset Location File Location Dataset Bookkeeping Service (CMS) higher level, interface to physicist provide query mechanism output is set of Data chunk(s) Data Chunk is an unbreakable unit (Atom). granularity defined by DM WM (today is Dataset... ) Data Location Service Given key identifying DataChunk list of SE(s) RB get CE(s) Can be done at UI or RB level Use only abstract Data, not files! Local File Catalog Available at local sites GUID to PFN mapping

Outline 1 Workload Management: the CMS way General Architecture Present Future 2 CMS SW deployment Job clustering Data discovery and Location Data access monitoring Job Monitoring Output handling Users Support Open Science Grid integration

CMS SW deployment CMS SW deployment. User job runs against pre-installed CMS software Lot of childhood problems here. Installation tool (xcmsi) available and working; Installation done via ad-hoc job run by cms SoftwareManager Grid privileged user; Meeting last week with Operation (Nick) and xcmsi team to drive future Setup an automatic mechanism to deploy releases as soon as they are available;

Job clustering Job clustering. Typical User job is splitted into several subjobs each accessing a fraction of total input data Subjobs are identical but for few bits Same Input Sandbox, same requirements, etc... Need job cluster (or bulk) seen as a single entity Allow for bulk operations (submission, query, status, cancel,... ) Also possible to get access to single sub jobs Splitting can be done at UI or (possibly) at RB level, using Data Location and resource matching

Data discovery and Location Data Discovery and Location. Current implementation based on RefDB and PubDB will be re factorized into DBS, DLS; Data discovery is (and will remain) CMS specific (DBS); Data location is not: DLS can be a common tool for LCG; CMS choice is to avoid file-based data discovery; User (and user application) does not access single files, but data chunks; User does not need to know which are the files she will access from WN; Need to know about files only at WN level, not before! CMS want to decouple data discovery from File access;

Data access monitoring Dataset access monitoring Analyze data access pattern. Which data have been accessed by users? Which datasets need to be replicated? How efficiently are we accessing remote resources? Action: not easy today. Need central monitoring, not trivial to setup. Will investigate LCG Logging and Bookkeeping service and Monalisa Could spoils CMS specific site access problems (eg problems with incomplete catalogs, etc...) or problem with specific dataset

Job Monitoring Job Monitoring. Active discussion with BOSS team to useful integration with CRAB; Agreement with BOSS 4 architecture; Will be fully used by CRAB; Job tracking; Batch scheduler (local or grid) interface; Job monitoring; Application real time monitoring (good if you can have it, life goes on if you can t policy); Job logging and bookkeeping;

Output handling Output produced. User wants output on her computer or on a storage accessible from her computer (via posix or any usable protocol, eg RFIO) In general not interesting to have output on Grid Different for production use cases If storage has the proper server installed (e.g. gsiftp) possible to just copy the output when done. What about publishing user output so that other people (e.g. same analysis group) could use it? What about promoting private output (e.g. re-reconstruction) to official : provenance, etc...? New DM WM architecture should allow both.

Users Support User Support. Critical and time consuming issue: means CRAB actively used! Analysis is performed by generic users, with little or no specific knowledge about grid User does not want to became an expert also on Grid: there are already so many things she need to know to do analysis, too much! Pure Grid problems; CRAB support; Data access support: problems with catalogs, missing files, problems with MSS, etc... ORCA problems... First level Triage: inter operation with Grid/sites support

Open Science Grid integration Open Science Grid integration Very useful discussion begun with Frank Wuerthwein and his group General idea: Provide just one interface (CRAB) to final user; back end for LCG and OSG could (will) be different due to differences of the two Grid architectures; Many common tool and service will be shared: notably the Data Management infrastructure (DBS, DLS, etc... )

Open Science Grid integration Summary CMS first working prototype for Distributed User Analysis is available and used by real users; Many lesson learnt from CRAB usage; First lesson: people is using it Second lesson: real effort must be put in deployment the more problems are not to be addressed by CMS directly, the better